## Abstract

This paper gives a comparative study of the K-means algorithm and the mixture model (MM) method for clustering normal data. The EM algorithm is used to compute the maximum likelihood estimators (MLEs) of the parameters of the MM model. These parameters include mixing proportions, which may be thought of as the prior probabilities of different clusters; the maximum posterior (Bayes) rule is used for clustering. Hence, asymptotically the MM method approaches the Bayes rule for known parameters, which is optimal in terms of minimizing the expected misclassification rate (EMCR). The paper gives a thorough analytic comparison of the two methods for the univariate case under both homoscedasticity and heteroscedasticity. Simulation results are given to compare the two methods for a range of sample sizes. The comparison, which is limited to two clusters, shows that the MM method has substantially lower EMCR particularly when the mixing proportions are unbalanced. The two methods have asymptotically the same EMCR under homoscedasticity (resp., heteroscedasticity) when the mixing proportions of the two clusters are equal (resp., unequal), but for small samples the MM method sometimes performs slightly worse because of the errors in estimating unknown parameters.

Original language | English (US) |
---|---|

Pages (from-to) | 3722-3740 |

Number of pages | 19 |

Journal | Journal of Statistical Planning and Inference |

Volume | 137 |

Issue number | 11 |

DOIs | |

State | Published - Nov 1 2007 |

## Keywords

- Bayes rule
- Clustering
- Data mining
- EM algorithm
- K-means algorithm
- Misclassification rate
- Mixture model
- Prior and posterior probabilities

## ASJC Scopus subject areas

- Statistics and Probability
- Statistics, Probability and Uncertainty
- Applied Mathematics