Semantic Document Distance Measures and Unsupervised Document Revision Detection

Xiaofeng Zhu, Diego Klabjan, Patrick N. Bless

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps 1 and simulated data sets.
Original languageEnglish (US)
Title of host publicationProceedings of the Eight nternational Joint Conference on Natural Language Processing (IJCNLP 2017)
Pages947-956
Number of pages10
Volume1
StatePublished - 2017

Fingerprint Dive into the research topics of 'Semantic Document Distance Measures and Unsupervised Document Revision Detection'. Together they form a unique fingerprint.

Cite this