Hephaestus: Data reuse for accelerating scientific discovery

Jennie Duggan, Michael L. Brodie

Research output: Contribution to conferencePaperpeer-review

9 Scopus citations


Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement [15] and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research [5, 27]. For example, Google Flu Trends published their algorithms in 2008 [19] for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control [32]. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par with its causality-driven predecessors.

Original languageEnglish (US)
StatePublished - Jan 1 2015
Event7th Biennial Conference on Innovative Data Systems Research, CIDR 2015 - Asilomar, United States
Duration: Jan 4 2015Jan 7 2015


Conference7th Biennial Conference on Innovative Data Systems Research, CIDR 2015
Country/TerritoryUnited States

ASJC Scopus subject areas

  • Information Systems and Management
  • Hardware and Architecture
  • Artificial Intelligence
  • Information Systems


Dive into the research topics of 'Hephaestus: Data reuse for accelerating scientific discovery'. Together they form a unique fingerprint.

Cite this