Methods for domain-independent information extraction from the web: An experimental comparison

Oren Etzioni*, Michael Cafarella, Doug Downey, Ana Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

81 Scopus citations

Abstract

Our KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an autonomous, domain-independent, and scalable manner. In its first major run, KNOWITALL extracted over 50,000 facts with high precision, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Rule Learning learns domain-specific extraction rules. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, no hand-labeled training examples are required. Experiments show the relative coverage of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 19-fold increase in recall, while maintaining high precision, and discovered 10,300 cities missing from the Tipster Gazetteer.

Original languageEnglish (US)
Title of host publicationProceedings - Nineteenth National Conference on Artificial Intelligence (AAAI-04)
Subtitle of host publicationSixteenth Innovative Applications of Artificial Intelligence Conference (IAAI-2004)
Pages391-398
Number of pages8
StatePublished - Dec 9 2004
EventProceedings - Nineteenth National Conference on Artificial Intelligence (AAAI-2004): Sixteenth Innovative Applications of Artificial Intelligence Conference (IAAI-2004) - San Jose, CA, United States
Duration: Jul 25 2004Jul 29 2004

Other

OtherProceedings - Nineteenth National Conference on Artificial Intelligence (AAAI-2004): Sixteenth Innovative Applications of Artificial Intelligence Conference (IAAI-2004)
CountryUnited States
CitySan Jose, CA
Period7/25/047/29/04

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Methods for domain-independent information extraction from the web: An experimental comparison'. Together they form a unique fingerprint.

Cite this