A scalable parallel subspace clustering algorithm for massive data sets

H. S. Nagesh, S. Goil, A. Choudhary

Research output: Chapter in Book/Report/Conference proceedingConference contribution

44 Scopus citations

Abstract

Clustering is a data mining problem which finds dense regions in a sparse multi-dimensional data set. The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and also with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a subspace of a high dimensional space. However the time complexity of the algorithm to explore clusters in subspaces is exponential in the dimensionality of the data and is thus extremely compute intensive. Thus, parallelization is the choice for discovering clusters for large data sets. In this paper we present a scalable parallel subspace clustering algorithm which has both data and task parallelism embedded in it. We also formulate the technique of adaptive grids and present a truly unsupervised clustering algorithm requiring no user inputs. Our implementation shows near linear speedups with negligible communication overheads. The use of adaptive grids results in two orders of magnitude improvement in the computation time of our serial algorithm over current methods with much better quality of clustering. Performance results on both real and synthetic data sets with very large number of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical and scalable clustering technique.

Original languageEnglish (US)
Title of host publicationProceedings - 2000 International Conference on Parallel Processing, ICPP 2000
EditorsDavid J. Lilja
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages477-484
Number of pages8
ISBN (Electronic)0769507689
DOIs
StatePublished - 2000
EventInternational Conference on Parallel Processing, ICPP 2000 - Toronto, Canada
Duration: Aug 21 2000Aug 24 2000

Publication series

NameProceedings of the International Conference on Parallel Processing
Volume2000-January
ISSN (Print)0190-3918

Other

OtherInternational Conference on Parallel Processing, ICPP 2000
CountryCanada
CityToronto
Period8/21/008/24/00

Keywords

  • Clustering algorithms
  • Concurrent computing
  • Data mining
  • Grid computing
  • Large-scale systems
  • Machine learning
  • Machine learning algorithms
  • Parallel processing
  • Statistics
  • Sun

ASJC Scopus subject areas

  • Software
  • Mathematics(all)
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'A scalable parallel subspace clustering algorithm for massive data sets'. Together they form a unique fingerprint.

Cite this