A parametric mixture model for clustering multivariate binary data

Ajit C. Tamhane, Dingxi Qiu*, Bruce E. Ankenman

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

The traditional latent class analysis (LCA) uses a mixture model with binary responses on each subject that are independent conditional on cluster membership. However, in many practical applications, the responses are correlated because they are observed on the same subject; this is known as local dependence. In this paper, we extend the LCA model to allow for local dependence in each cluster to improve clustering accuracy. The clustering problem is hard because of its unsupervised learning nature (the true cluster memberships and even the true number of clusters are unknown), the difficulty of estimating a correlation matrix for each cluster and the paucity of information in binary data. Therefore, we follow a parametric approach in which we fit a mixture model whose components follow multivariate Bernoulli distributions (one for each cluster). An extension of a family of parametric models by Oman and Zucker [1] is adopted for this purpose and the maximum likelihood estimation method is used for fitting. The Bayesian information criterion (BIC) due to Schwarz [2] is employed to select the number of clusters. Subjects are classified to clusters using the maximum posterior rule. The proposed method is tested and compared with the LCA method via simulation and by applying both methods to two real data sets. Significant improvement is demonstrated relative to the LCA method.

Original languageEnglish (US)
Pages (from-to)3-19
Number of pages17
JournalStatistical Analysis and Data Mining
Volume3
Issue number1
DOIs
StatePublished - Feb 2010

Keywords

  • Bayes classification rule
  • Bayesian information criterion (BIC)
  • Data mining
  • EM algorithm
  • Latent class analysis
  • Maximum likelihood estimation

ASJC Scopus subject areas

  • Analysis
  • Information Systems
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'A parametric mixture model for clustering multivariate binary data'. Together they form a unique fingerprint.

Cite this