TY - GEN
T1 - Semi supervised image spam hunter
T2 - 5th International Conference on Advanced Data Mining and Applications, ADMA 2009
AU - Gao, Yan
AU - Yang, Ming
AU - Choudhary, Alok
N1 - Copyright:
Copyright 2009 Elsevier B.V., All rights reserved.
PY - 2009
Y1 - 2009
N2 - Image spam is a new trend in the family of email spams. The new image spams employ a variety of image processing technologies to create random noises. In this paper, we propose a semi-supervised approach, regularized discriminant EM algorithm (RDEM), to detect image spam emails, which leverages small amount of labeled data and large amount of unlabeled data for identifying spams and training a classification model simultaneously. Compared with fully supervised learning algorithms, the semi-supervised learning algorithm is more suitedin adversary classification problems, because the spammers are actively protecting their work by constantly making changes to circumvent the spam detection. It makes the cost too high for fully supervised learning to frequently collect sufficient labeled data for training. Experimental results demonstrate that our approach achieves 91.66% high detection rate with less than 2.96% false positive rate, meanwhile it significantly reduces the labeling cost.
AB - Image spam is a new trend in the family of email spams. The new image spams employ a variety of image processing technologies to create random noises. In this paper, we propose a semi-supervised approach, regularized discriminant EM algorithm (RDEM), to detect image spam emails, which leverages small amount of labeled data and large amount of unlabeled data for identifying spams and training a classification model simultaneously. Compared with fully supervised learning algorithms, the semi-supervised learning algorithm is more suitedin adversary classification problems, because the spammers are actively protecting their work by constantly making changes to circumvent the spam detection. It makes the cost too high for fully supervised learning to frequently collect sufficient labeled data for training. Experimental results demonstrate that our approach achieves 91.66% high detection rate with less than 2.96% false positive rate, meanwhile it significantly reduces the labeling cost.
UR - http://www.scopus.com/inward/record.url?scp=70350347131&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350347131&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-03348-3_17
DO - 10.1007/978-3-642-03348-3_17
M3 - Conference contribution
AN - SCOPUS:70350347131
SN - 3642033474
SN - 9783642033476
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 152
EP - 164
BT - Advanced Data Mining and Applications - 5th International Conference, ADMA 2009, Proceedings
Y2 - 17 August 2009 through 19 August 2009
ER -