A Fast DBSCAN Algorithm with Spark Implementation

Dianwei Han*, Ankit Agrawal, Wei keng Liao, Alok Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapter

5 Scopus citations

Abstract

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

Original languageEnglish (US)
Title of host publicationStudies in Big Data
PublisherSpringer Science and Business Media Deutschland GmbH
Pages173-192
Number of pages20
DOIs
StatePublished - 2018

Publication series

NameStudies in Big Data
Volume44
ISSN (Print)2197-6503
ISSN (Electronic)2197-6511

Keywords

  • Big data
  • DBSCAN
  • Scalable data mining
  • Spark framework

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Engineering (miscellaneous)
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'A Fast DBSCAN Algorithm with Spark Implementation'. Together they form a unique fingerprint.

Cite this