TY - GEN
T1 - S2AND
T2 - 21st ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
AU - Subramanian, Shivashankar
AU - King, Daniel
AU - Downey, Doug
AU - Feldman, Sergey
N1 - Funding Information:
We thank Zejiang Shen, Sonia Murthy, Kyle Lo, and Dan Weld for helpful feedback; Bailey Kuehl and Rodney Kinney for their extensive evaluation; Regan Huff, Jason Dunkelberger, Joanna Power, Angele Zamarron, and Brandon Stilson for their engineering work that turned our Python code into something that scales to 200 million papers; and the entire Semantic Scholar team for creating the underlying data. This work was supported in part by NSF grant OIA-2033558.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets. In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values. Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community.1.
AB - Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets. In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values. Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community.1.
KW - Author name disambiguation
KW - Digital libraries
KW - Out-of-domain evaluation
UR - http://www.scopus.com/inward/record.url?scp=85124219155&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124219155&partnerID=8YFLogxK
U2 - 10.1109/JCDL52503.2021.00029
DO - 10.1109/JCDL52503.2021.00029
M3 - Conference contribution
AN - SCOPUS:85124219155
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 170
EP - 179
BT - Proceedings - 2021 ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021
A2 - Downie, J. Stephen
A2 - McKay, Dana
A2 - Suleman, Hussein
A2 - Nichols, David M.
A2 - Poursardar, Faryaneh
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 September 2021 through 30 September 2021
ER -