A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Md Mostofa Ali Patwary*, Diana Palsetia, Ankit Agrawal, Wei Keng Liao, Fredrik Manne, Alok Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

140 Scopus citations

Abstract

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

Original languageEnglish (US)
Title of host publication2012 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012
DOIs
StatePublished - 2012
Event2012 24th International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012 - Salt Lake City, UT, United States
Duration: Nov 10 2012Nov 16 2012

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Other

Other2012 24th International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012
Country/TerritoryUnited States
CitySalt Lake City, UT
Period11/10/1211/16/12

Keywords

  • Density based clustering
  • Disjoint-set data structure
  • Union-Find algorithm

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'A new scalable parallel DBSCAN algorithm using the disjoint-set data structure'. Together they form a unique fingerprint.

Cite this