Learning to group web text incorporating prior information

Yu Cheng*, Kunpeng Zhang, Yusheng Xie, Ankit Agrawal, Wei-Keng Liao, Alok Nidhi Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations


Clustering similar items for web text has become increasingly important in manyWeb and Information Retrieval applications. For several kinds of web text data, it is much easier to obtain some external information other than textual features which can be utilized to improve the performance of clustering analysis. This external information, called prior information, indicates label sign and pairwise constraints on sample points. We propose a unifying framework that can incorporate prior information of cluster membership for web text cluster analysis and develop a novel semi-supervised clustering model. The proposed framework offers several advantages over existing semi-supervised approaches. First, most previous work handles labeled data by converting it to pairwise constraints and thus leads to much more computation. The proposed approach can handle pairwise constraints together with labeled data simultaneously so that the computation is greatly reduced. Second, the framework allows us to obtain these prior information automatically or only with little human effort, thus, making it possible to boost the clustering learning performance relatively easily. We evaluated the proposed method on the real-world problems of automatically grouping online news feeds and web blog messages. Experimental results indicate the proposed framework incorporating prior information can indeed lead to statistically significant clustering improvements over the performance of approaches access only to textual features.

Original languageEnglish (US)
Title of host publicationProceedings - 11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
Number of pages8
StatePublished - Dec 1 2011
Event11th IEEE International Conference on Data Mining Workshops, ICDMW 2011 - Vancouver, BC, Canada
Duration: Dec 11 2011Dec 11 2011


Other11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
CityVancouver, BC


  • Pairwise constraints
  • Prior information
  • Semi-supervised clustering
  • Web text

ASJC Scopus subject areas

  • Engineering(all)


Dive into the research topics of 'Learning to group web text incorporating prior information'. Together they form a unique fingerprint.

Cite this