TY - JOUR

T1 - Clustering semi-random mixtures of Gaussians

AU - Awasthi, Pranjal

AU - Vijayaraghavan, Aravindan

N1 - Publisher Copyright:
Copyright © 2017, The Authors. All rights reserved.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2017/11/23

Y1 - 2017/11/23

N2 - Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural semi-random model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. In our model, a semi-random adversary is allowed to make arbitrary “monotone” or helpful changes to the data generated from the Gaussian mixture model. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching information-theoretic lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.

AB - Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural semi-random model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. In our model, a semi-random adversary is allowed to make arbitrary “monotone” or helpful changes to the data generated from the Gaussian mixture model. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching information-theoretic lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.

UR - http://www.scopus.com/inward/record.url?scp=85092813678&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85092813678&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85092813678

JO - Free Radical Biology and Medicine

JF - Free Radical Biology and Medicine

SN - 0891-5849

ER -