Sparse information extraction: Unsupervised language models to the rescue

Doug Downey*, Stefan Schoenmackers, Oren Etzioni

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Scopus citations

Abstract

Even in a massive corpus such as the Web, a substantial fraction of extractions appear infrequently. This paper shows how to assess the correctness of sparse extractions by utilizing unsupervised language models. The REALM system, which combines HMM-based and n-gram-based language models, ranks candidate extractions by the likelihood that they are correct. Our experiments show that REALM reduces extraction error by 39%, on average, when compared with previous work. Because REALM pre-computes language models based on its corpus and does not require any hand-tagged seeds, it is far more scalable than approaches that learn models for each individual relation from hand-tagged data. Thus, REALM is ideally suited for open information extraction where the relations of interest are not specified in advance and their number is potentially vast.

Original languageEnglish (US)
Title of host publicationACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
Pages696-703
Number of pages8
StatePublished - Dec 1 2007
Event45th Annual Meeting of the Association for Computational Linguistics, ACL 2007 - Prague, Czech Republic
Duration: Jun 23 2007Jun 30 2007

Other

Other45th Annual Meeting of the Association for Computational Linguistics, ACL 2007
Country/TerritoryCzech Republic
CityPrague
Period6/23/076/30/07

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Sparse information extraction: Unsupervised language models to the rescue'. Together they form a unique fingerprint.

Cite this