TY - JOUR
T1 - Unsupervised named-entity extraction from the Web
T2 - An experimental study
AU - Etzioni, Oren
AU - Cafarella, Michael
AU - Downey, Doug
AU - Popescu, Ana Maria
AU - Shaked, Tal
AU - Soderland, Stephen
AU - Weld, Daniel S.
AU - Yates, Alexander
N1 - Funding Information:
KNOWITALL was inspired, in part, by the WebKB project [13]. However, the two projects rely on very different architectures and learning techniques. For example, WebKB relies on supervised learning methods that take as input hand-labeled hypertext regions to classify Web pages, whereas KNOWITALL employs unsupervised learning methods that extract facts by using search engines to home in on easy-to-understand sentences scattered throughout the Web. Finally, KNOWITALL also shares the motivation of Schubert’s project [39], which seeks to derive general world knowledge from texts. However, Schubert and his colleagues have focused on highly-structured texts such as WordNet and the Brown corpus whereas KNOWITALL has focused on the Web.
Funding Information:
This research was supported in part by NSF grants IIS-0312988 and IIS-0307906, DARPA contract NBCHD030010, ONR grants N00014-02-1-0324 and N00014-02-1- 0932, and a gift from Google. Google generously allowed us to issue a large number of queries to their XML API to facilitate our experiments. We thank Jeff Bigham, and Nick Kushmerick for comments on previous drafts, and Bob Doorenbos, Mike Perkowitz, and Ellen Riloff for helpful discussions.
PY - 2005/6
Y1 - 2005/6
N2 - The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
AB - The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
KW - Information Extraction
KW - Pointwise mutual information
KW - Question answering
KW - Unsupervised
UR - http://www.scopus.com/inward/record.url?scp=17644423946&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=17644423946&partnerID=8YFLogxK
U2 - 10.1016/j.artint.2005.03.001
DO - 10.1016/j.artint.2005.03.001
M3 - Article
AN - SCOPUS:17644423946
SN - 0004-3702
VL - 165
SP - 91
EP - 134
JO - Artificial Intelligence
JF - Artificial Intelligence
IS - 1
ER -