Measuring document similarity with weighted averages of word embeddings

Bryan Seegmiller*, Dimitris Papanikolaou, Lawrence D.W. Schmidt

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method's usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.

Original languageEnglish (US)
Article number101494
JournalExplorations in Economic History
Volume87
DOIs
StatePublished - Jan 2023

Keywords

  • Document similarity
  • Natural language processing
  • Textual analysis for economists

ASJC Scopus subject areas

  • History
  • Economics and Econometrics

Fingerprint

Dive into the research topics of 'Measuring document similarity with weighted averages of word embeddings'. Together they form a unique fingerprint.

Cite this