TY - GEN
T1 - Cluchunk
T2 - 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine-12 - Held in Conjunction with SIGKDD Conference
AU - Cheng, Yu
AU - Xie, Yusheng
AU - Zhang, Kunpeng
AU - Agrawal, Ankit
AU - Choudhary, Alok
PY - 2012
Y1 - 2012
N2 - The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for e±cient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.
AB - The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for e±cient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.
KW - Chunklet
KW - Data transformation
KW - Text clustering
KW - User-generated content
UR - http://www.scopus.com/inward/record.url?scp=84866614612&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866614612&partnerID=8YFLogxK
U2 - 10.1145/2351316.2351319
DO - 10.1145/2351316.2351319
M3 - Conference contribution
AN - SCOPUS:84866614612
SN - 9781450315470
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 12
EP - 19
BT - Proceedings of 1st Int. Workshop on Big Data, Streams and Heterogeneous Source Mining
Y2 - 12 August 2012 through 12 August 2012
ER -