TY - CONF
T1 - Hephaestus
T2 - 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015
AU - Duggan, Jennie
AU - Brodie, Michael L.
N1 - Funding Information:
The authors thank their colleagues for their insightful feedback, including Sam Madden, Eugene Wu, Michael Stonebraker, Herb Lin, Jaime Carbonell, Gregory Piatetsky-Shapiro, and Eli Upfal. We are also grateful to the anonymous reviewers of this paper whose comments guided many important improvements. We also acknowledge the many scientists who have so generously taken the time to answer our questions on how scientific research is conducted, including Pete Szolovits, Pardis Sabeti, James Michaelson, and Tanya Monro. This work was funded by the Intel Science and Technology Center for Big Data.
Funding Information:
The authors thank their colleagues for their insightful feedback, including Sam Madden, Eugene Wu, Michael Stone-braker, Herb Lin, Jaime Carbonell, Gregory Piatetsky-Shapiro, and Eli Upfal. We are also grateful to the anonymous reviewers of this paper whose comments guided many important improvements. We also acknowledge the many scientists who have so generously taken the time to answer our questions on how scientific research is conducted, including Pete Szolovits, Pardis Sabeti, James Michaelson, and Tanya Monro. This work was funded by the Intel Science and Technology Center for Big Data.
Publisher Copyright:
© 2015 Conference on Innovative Data Systems Research (CIDR). All rights reserved.
PY - 2015/1/1
Y1 - 2015/1/1
N2 - Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement [15] and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research [5, 27]. For example, Google Flu Trends published their algorithms in 2008 [19] for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control [32]. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par with its causality-driven predecessors.
AB - Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement [15] and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research [5, 27]. For example, Google Flu Trends published their algorithms in 2008 [19] for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control [32]. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par with its causality-driven predecessors.
UR - http://www.scopus.com/inward/record.url?scp=85084015642&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084015642&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85084015642
Y2 - 4 January 2015 through 7 January 2015
ER -