S2AND: A Benchmark and Evaluation System for Author Name Disambiguation

Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets. In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values. Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community.1.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
EditorsJ. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, Faryaneh Poursardar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages170-179
Number of pages10
ISBN (Electronic)9781665417709
DOIs
StatePublished - 2021
Event21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021 - Virtual, Online, United States
Duration: Sep 27 2021Sep 30 2021

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2021-September
ISSN (Print)1552-5996

Conference

Conference21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
Country/TerritoryUnited States
CityVirtual, Online
Period9/27/219/30/21

Keywords

  • Author name disambiguation
  • Digital libraries
  • Out-of-domain evaluation

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint

Dive into the research topics of 'S2AND: A Benchmark and Evaluation System for Author Name Disambiguation'. Together they form a unique fingerprint.

Cite this