TY - JOUR
T1 - Analysis of a probabilistic model of redundancy in unsupervised information extraction
AU - Downey, Douglas C
AU - Etzioni, Oren
AU - Soderland, Stephen
N1 - Funding Information:
This research was supported in part by NSF grants IIS-0535284 and IIS-0312988, DARPA contract NBCHD030010, ONR grants N00014-02-1-0324 and N00014-08-1-0431, and gifts from Google, and carried out at the University of Washington’s Turing Center. The first author was supported by a Microsoft Research Graduate Fellowship sponsored by Microsoft Live Labs. Google generously allowed us to issue a large number of queries to their XML API to facilitate our experiments. We thank Pedro Domingos, Anna Karlin, Marina Meila, and Dan Weld for helpful discussions, and Jeff Bigham for comments on previous drafts. Also, thanks to Alex Yates for suggesting we consider this problem.
PY - 2010/7
Y1 - 2010/7
N2 - Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: The automatic accumulation of massive bodies of knowledge. A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? We present a combinatorial "balls-and-urns" model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.
AB - Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: The automatic accumulation of massive bodies of knowledge. A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? We present a combinatorial "balls-and-urns" model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.
KW - Information extraction
KW - Unsupervised
KW - World Wide Web
UR - http://www.scopus.com/inward/record.url?scp=79955785333&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79955785333&partnerID=8YFLogxK
U2 - 10.1016/j.artint.2010.04.024
DO - 10.1016/j.artint.2010.04.024
M3 - Article
AN - SCOPUS:79955785333
SN - 0004-3702
VL - 174
SP - 726
EP - 748
JO - Artificial Intelligence
JF - Artificial Intelligence
IS - 11
ER -