Scaling semi-supervised Naive Bayes with feature marginals

Michael R. Lucas, Douglas C Downey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data. SSL techniques are often effective in text classification, where labeled data is scarce but large unlabeled corpora are readily available. However, existing SSL techniques typically require multiple passes over the entirety of the unlabeled data, meaning the techniques are not applicable to large corpora being produced today. In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classification that scales to massive unlabeled data sets. We present a novel learning algorithm, which optimizes a Naive Bayes model to accord with statistics calculated from the unlabeled corpus. In experiments with text topic classification and sentiment analysis, we show that our method is both more scalable and more accurate than SSL techniques from previous work.

Original languageEnglish (US)
Title of host publicationLong Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages343-351
Number of pages9
ISBN (Print)9781937284503
StatePublished - Jan 1 2013
Event51st Annual Meeting of the Association for Computational Linguistics, ACL 2013 - Sofia, Bulgaria
Duration: Aug 4 2013Aug 9 2013

Publication series

NameACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Volume1

Other

Other51st Annual Meeting of the Association for Computational Linguistics, ACL 2013
Country/TerritoryBulgaria
CitySofia
Period8/4/138/9/13

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Scaling semi-supervised Naive Bayes with feature marginals'. Together they form a unique fingerprint.

Cite this