Hephaestus: Data Reuse for Accelerating Scientific Discovery

Jennie M Duggan, Michael L Brodie

Research output: Contribution to conferencePaper

Abstract

Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research. For example, Google Flu Trends published their algorithms in 2008 for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control.

This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par with
its causality-driven predecessors.
Original languageEnglish (US)
Number of pages12
StatePublished - 2015
Event7th Biennial Conference on Innovative Data Systems Research - California, Asilomar, United States
Duration: Jan 4 2015Jan 7 2015

Conference

Conference7th Biennial Conference on Innovative Data Systems Research
Abbreviated titleCIDR '15
CountryUnited States
CityAsilomar
Period1/4/151/7/15

Fingerprint

experiment
physical science
disease control
visualization
public health
anomaly
science
trial
social science
trend

Cite this

Duggan, J. M., & Brodie, M. L. (2015). Hephaestus: Data Reuse for Accelerating Scientific Discovery . Paper presented at 7th Biennial Conference on Innovative Data Systems Research, Asilomar, United States.
Duggan, Jennie M ; Brodie, Michael L. / Hephaestus : Data Reuse for Accelerating Scientific Discovery . Paper presented at 7th Biennial Conference on Innovative Data Systems Research, Asilomar, United States.12 p.
@conference{631be5a007f44e6a860202e8696c54bf,
title = "Hephaestus: Data Reuse for Accelerating Scientific Discovery",
abstract = "Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research. For example, Google Flu Trends published their algorithms in 2008 for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50{\%} higher than the number of cases reported by the U.S. Center for Disease Control. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par withits causality-driven predecessors.",
author = "Duggan, {Jennie M} and Brodie, {Michael L}",
year = "2015",
language = "English (US)",
note = "7th Biennial Conference on Innovative Data Systems Research, CIDR '15 ; Conference date: 04-01-2015 Through 07-01-2015",

}

Duggan, JM & Brodie, ML 2015, 'Hephaestus: Data Reuse for Accelerating Scientific Discovery ' Paper presented at 7th Biennial Conference on Innovative Data Systems Research, Asilomar, United States, 1/4/15 - 1/7/15, .

Hephaestus : Data Reuse for Accelerating Scientific Discovery . / Duggan, Jennie M; Brodie, Michael L.

2015. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, Asilomar, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Hephaestus

T2 - Data Reuse for Accelerating Scientific Discovery

AU - Duggan, Jennie M

AU - Brodie, Michael L

PY - 2015

Y1 - 2015

N2 - Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research. For example, Google Flu Trends published their algorithms in 2008 for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par withits causality-driven predecessors.

AB - Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research. For example, Google Flu Trends published their algorithms in 2008 for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par withits causality-driven predecessors.

M3 - Paper

ER -

Duggan JM, Brodie ML. Hephaestus: Data Reuse for Accelerating Scientific Discovery . 2015. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, Asilomar, United States.