TY - JOUR
T1 - Measuring document similarity with weighted averages of word embeddings
AU - Seegmiller, Bryan
AU - Papanikolaou, Dimitris
AU - Schmidt, Lawrence D.W.
N1 - Publisher Copyright:
© 2022 Elsevier Inc.
PY - 2023/1
Y1 - 2023/1
N2 - We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method's usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.
AB - We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method's usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.
KW - Document similarity
KW - Natural language processing
KW - Textual analysis for economists
UR - http://www.scopus.com/inward/record.url?scp=85144385263&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144385263&partnerID=8YFLogxK
U2 - 10.1016/j.eeh.2022.101494
DO - 10.1016/j.eeh.2022.101494
M3 - Article
AN - SCOPUS:85144385263
SN - 0014-4983
VL - 87
JO - Explorations in Economic History
JF - Explorations in Economic History
M1 - 101494
ER -