Cluchunk: Clustering large scale user-generated content incorporating chunklet information

Yu Cheng*, Yusheng Xie, Kunpeng Zhang, Ankit Agrawal, Alok Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for e±cient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.

Original languageEnglish (US)
Title of host publicationProceedings of 1st Int. Workshop on Big Data, Streams and Heterogeneous Source Mining
Subtitle of host publicationAlgorithms, Systems, Programming Models and Applications, BigMine-12 - Held in Conjunction with SIGKDD Conference
Pages12-19
Number of pages8
DOIs
StatePublished - 2012
Event1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine-12 - Held in Conjunction with SIGKDD Conference - Beijing, China
Duration: Aug 12 2012Aug 12 2012

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

Other1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine-12 - Held in Conjunction with SIGKDD Conference
Country/TerritoryChina
CityBeijing
Period8/12/128/12/12

Keywords

  • Chunklet
  • Data transformation
  • Text clustering
  • User-generated content

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Cluchunk: Clustering large scale user-generated content incorporating chunklet information'. Together they form a unique fingerprint.

Cite this