SciSciNet: A large-scale open data lake for the science of science research

Zihang Lin, Yian Yin, Lu Liu, Dashun Wang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

9 Scopus citations


The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.

Original languageEnglish (US)
Article number315
JournalScientific Data
Issue number1
StatePublished - Dec 2023

ASJC Scopus subject areas

  • Information Systems
  • Education
  • Library and Information Sciences
  • Statistics and Probability
  • Computer Science Applications
  • Statistics, Probability and Uncertainty


Dive into the research topics of 'SciSciNet: A large-scale open data lake for the science of science research'. Together they form a unique fingerprint.

Cite this