Locating complex named entities in web text

Doug Downey, Matthew Broadhead, Oren Etzioni

Research output: Contribution to journalConference articlepeer-review

102 Scopus citations

Abstract

Named Entity Recognition (NER) is the task of locating and classifying names in text. In previous work, NER was limited to a small number of predefined entity classes (e.g., people, locations, and organizations). However, NER on the Web is a far more challenging problem. Complex names (e.g., film or book titles) can be very difficult to pick out precisely from text. Further, the Web contains a wide variety of entity classes, which are not known in advance. Thus, hand-tagging examples of each entity class is impractical. This paper investigates a novel approach to the first step in Web NER: locating complex named entities in Web text. Our key observation is that named entities can be viewed as a species of multiword units, which can be detected by accumulating n-gram statistics over the Web corpus. We show that this statistical method's F1 score is 50% higher than that of supervised techniques including Conditional Random Fields (CRFs) and Conditional Markov Models (CMMs) when applied to complex names. The method also outperforms CMMs and CRFs by 117% on entity classes absent from the training data. Finally, our method outperforms a semi-supervised CRF by 73%.

Original languageEnglish (US)
Pages (from-to)2733-2739
Number of pages7
JournalIJCAI International Joint Conference on Artificial Intelligence
StatePublished - Dec 1 2007
Event20th International Joint Conference on Artificial Intelligence, IJCAI 2007 - Hyderabad, India
Duration: Jan 6 2007Jan 12 2007

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Locating complex named entities in web text'. Together they form a unique fingerprint.

Cite this