A comparative study of the K-means algorithm and the normal mixture model for clustering: Univariate case

Dingxi Qiu, Ajit C. Tamhane*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

28 Scopus citations

Abstract

This paper gives a comparative study of the K-means algorithm and the mixture model (MM) method for clustering normal data. The EM algorithm is used to compute the maximum likelihood estimators (MLEs) of the parameters of the MM model. These parameters include mixing proportions, which may be thought of as the prior probabilities of different clusters; the maximum posterior (Bayes) rule is used for clustering. Hence, asymptotically the MM method approaches the Bayes rule for known parameters, which is optimal in terms of minimizing the expected misclassification rate (EMCR). The paper gives a thorough analytic comparison of the two methods for the univariate case under both homoscedasticity and heteroscedasticity. Simulation results are given to compare the two methods for a range of sample sizes. The comparison, which is limited to two clusters, shows that the MM method has substantially lower EMCR particularly when the mixing proportions are unbalanced. The two methods have asymptotically the same EMCR under homoscedasticity (resp., heteroscedasticity) when the mixing proportions of the two clusters are equal (resp., unequal), but for small samples the MM method sometimes performs slightly worse because of the errors in estimating unknown parameters.

Original languageEnglish (US)
Pages (from-to)3722-3740
Number of pages19
JournalJournal of Statistical Planning and Inference
Volume137
Issue number11
DOIs
StatePublished - Nov 1 2007

Keywords

  • Bayes rule
  • Clustering
  • Data mining
  • EM algorithm
  • K-means algorithm
  • Misclassification rate
  • Mixture model
  • Prior and posterior probabilities

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'A comparative study of the K-means algorithm and the normal mixture model for clustering: Univariate case'. Together they form a unique fingerprint.

Cite this