Incremental, distributed single-linkage hierarchical clustering algorithm using MapReduce

Chen Jin, Zhengzhang Chen, William Hendrix, Ankit Agrawal, Alok Choudhary

Research output: Contribution to journalConference article

7 Scopus citations

Abstract

Single-linkage hierarchical clustering is one of the prominent and widely-used data mining techniques for its informative representation of clustering results. However, the parallelization of this algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. Moreover, in many modern applications, new data is continuously added into the already huge datasets. It would be impractical to reapply the clustering algorithm on the augmented datasets from scratch. In this paper, we propose a unified algorithm which can not only cluster the large dataset, but also incorporate the newly arrived data incrementally. More specifically, we formulate the single-linkage hierarchical clustering problem as a Minimum Spanning Tree (MST) construction problem on a complete graph. The algorithm decomposes the complete graph into a set of non-overlapped subgraphs, computes the intermediate sub-MSTs for each subgraph in parallel, and merges all the sub-MSTs to achieve the final solution. In addition, the same framework can treat the incremental data insertion as a separate data subset and integrate it nicely with the existing solution. We implement the unified algorithm by employing MapReduce framework. Using both synthetic and real-world datasets containing up to millions of high-dimensional points, we show that the proposed algorithm achieves a scalable speedup up to 200 on 300 computer cores for the base dataset and a speedup up to 120 for the dataset with maximum 5% random insertion. Copyright (c) Society for Modeling & Simulation International (SCS).

Original languageEnglish (US)
Pages (from-to)83-92
Number of pages10
JournalSimulation Series
Volume47
Issue number4
StatePublished - Jan 1 2015
Event23rd High Performance Computing Symposium, HPC 2015, Part of the 2015 Spring Simulation Multi-Conference, SpringSim 2015 - Alexandria, United States
Duration: Apr 12 2015Apr 15 2015

Keywords

  • Hierarchical clustering
  • Incremental hierarchical clustering
  • MapReduce

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Incremental, distributed single-linkage hierarchical clustering algorithm using MapReduce'. Together they form a unique fingerprint.

  • Cite this