Sizing the problem of improving discovery and access to NIH-Funded data: A preliminary study

Kevin B. Read, Jerry R. Sheehan, Michael F. Huerta, Lou S. Knecht, James G. Mork, Betsy L. Humphreys, Swapna Abhyankar, Olubumi Akiwumi, Olivier Bodenreider, Sally Davidson, Dina Demner Fushman, Tracy Edinger, Greg Farber, Karen Gutzman, Mary Ann Hantakas, Preeti Kochar, Jennie Larkin, Peter Lyster, Matt McAuliffe, Shari MoharyHelen Ochej, Olga Printseva, Oleg Rodionov, Laritza Rodriguez, Suzy Roy, Susan Schmidt, Sonya Shooshan, Matthew Simpson, Corinn Sinnot, Samantha Tate, Janice Ward, Melissa Yorks

Research output: Contribution to journalArticlepeer-review

22 Scopus citations


Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "invisible" or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a "dataset," determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.

Original languageEnglish (US)
Article numbere0132735
JournalPloS one
Issue number7
StatePublished - Jul 24 2015

ASJC Scopus subject areas

  • General


Dive into the research topics of 'Sizing the problem of improving discovery and access to NIH-Funded data: A preliminary study'. Together they form a unique fingerprint.

Cite this