Semi-supervised text classification

Partial training from unlabeled data

Yingtao Bi*, Daniel R. Jeske, Regina Y. Liu

*Corresponding author for this work

Research output: Contribution to conferencePaper

1 Citation (Scopus)

Abstract

We illustrate by a case study how a semi-supervised approach can improve the performance of text classification. We begin with a naïve Bayes classifier trained exclusively from labeled text documents, and apply it to a set of unlabeled text documents to derive their pseudo-labels. The pseudo-labels are then combined with the true labels in the original training sample, and a naïve Bayes classifier is built based on the enlarged training sample. We consider different proportions of pseudo-labels in the enlarged training sample, and examine the effect of the semi-supervised approach on the misclassification rate by using cross validation comparisons.

Original languageEnglish (US)
StatePublished - Dec 1 2006
Event2006 IIE Annual Conference and Exposition - Orlando, FL, United States
Duration: May 20 2006May 24 2006

Other

Other2006 IIE Annual Conference and Exposition
CountryUnited States
CityOrlando, FL
Period5/20/065/24/06

Fingerprint

Labels
Classifiers

Keywords

  • Cross validation
  • Naïve Bayes classification
  • Semi-supervised learning
  • Text classification
  • Text mining

ASJC Scopus subject areas

  • Industrial and Manufacturing Engineering

Cite this

Bi, Y., Jeske, D. R., & Liu, R. Y. (2006). Semi-supervised text classification: Partial training from unlabeled data. Paper presented at 2006 IIE Annual Conference and Exposition, Orlando, FL, United States.
Bi, Yingtao ; Jeske, Daniel R. ; Liu, Regina Y. / Semi-supervised text classification : Partial training from unlabeled data. Paper presented at 2006 IIE Annual Conference and Exposition, Orlando, FL, United States.
@conference{c1cb698920ea41f5a3a42bce4454cb12,
title = "Semi-supervised text classification: Partial training from unlabeled data",
abstract = "We illustrate by a case study how a semi-supervised approach can improve the performance of text classification. We begin with a na{\"i}ve Bayes classifier trained exclusively from labeled text documents, and apply it to a set of unlabeled text documents to derive their pseudo-labels. The pseudo-labels are then combined with the true labels in the original training sample, and a na{\"i}ve Bayes classifier is built based on the enlarged training sample. We consider different proportions of pseudo-labels in the enlarged training sample, and examine the effect of the semi-supervised approach on the misclassification rate by using cross validation comparisons.",
keywords = "Cross validation, Na{\"i}ve Bayes classification, Semi-supervised learning, Text classification, Text mining",
author = "Yingtao Bi and Jeske, {Daniel R.} and Liu, {Regina Y.}",
year = "2006",
month = "12",
day = "1",
language = "English (US)",
note = "2006 IIE Annual Conference and Exposition ; Conference date: 20-05-2006 Through 24-05-2006",

}

Bi, Y, Jeske, DR & Liu, RY 2006, 'Semi-supervised text classification: Partial training from unlabeled data' Paper presented at 2006 IIE Annual Conference and Exposition, Orlando, FL, United States, 5/20/06 - 5/24/06, .

Semi-supervised text classification : Partial training from unlabeled data. / Bi, Yingtao; Jeske, Daniel R.; Liu, Regina Y.

2006. Paper presented at 2006 IIE Annual Conference and Exposition, Orlando, FL, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Semi-supervised text classification

T2 - Partial training from unlabeled data

AU - Bi, Yingtao

AU - Jeske, Daniel R.

AU - Liu, Regina Y.

PY - 2006/12/1

Y1 - 2006/12/1

N2 - We illustrate by a case study how a semi-supervised approach can improve the performance of text classification. We begin with a naïve Bayes classifier trained exclusively from labeled text documents, and apply it to a set of unlabeled text documents to derive their pseudo-labels. The pseudo-labels are then combined with the true labels in the original training sample, and a naïve Bayes classifier is built based on the enlarged training sample. We consider different proportions of pseudo-labels in the enlarged training sample, and examine the effect of the semi-supervised approach on the misclassification rate by using cross validation comparisons.

AB - We illustrate by a case study how a semi-supervised approach can improve the performance of text classification. We begin with a naïve Bayes classifier trained exclusively from labeled text documents, and apply it to a set of unlabeled text documents to derive their pseudo-labels. The pseudo-labels are then combined with the true labels in the original training sample, and a naïve Bayes classifier is built based on the enlarged training sample. We consider different proportions of pseudo-labels in the enlarged training sample, and examine the effect of the semi-supervised approach on the misclassification rate by using cross validation comparisons.

KW - Cross validation

KW - Naïve Bayes classification

KW - Semi-supervised learning

KW - Text classification

KW - Text mining

UR - http://www.scopus.com/inward/record.url?scp=84858472286&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84858472286&partnerID=8YFLogxK

M3 - Paper

ER -

Bi Y, Jeske DR, Liu RY. Semi-supervised text classification: Partial training from unlabeled data. 2006. Paper presented at 2006 IIE Annual Conference and Exposition, Orlando, FL, United States.