A novel scalable DBSCAN algorithm with spark

Dianwei Han*, Ankit Agrawal, Wei Keng Liao, Alok Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

43 Scopus citations

Abstract

DBSCAN is a well-known clustering algorithmwhich is based on density and is able to identify arbitraryshaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPIor OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists tohave an advanced experience to handle communication betweennodes which is a big challenge. We present a new parallel DBSCAN algorithm using thenew big data framework Spark. In order to reduce searchtime, we apply kd-tree in our algorithm. More specifically, wepropose a novel approach to avoid communication betweenexecutors so that we can locally obtain partial clusters moreefficiently. Based on Java API, we select appropriate datastructures carefully: Using Queue to contain neighbors of thedata point, and using Hashtable when checking the statusof and processing the data points. In addition, we use otheradvanced features from Spark to make our implementationmore effective. We implement the algorithm in Java andevaluate its scalability by using different number of processingcores. Our experiments demonstrate that the algorithm wepropose scales up very well. Using data sets containing up to1 million high-dimensional points, we show that our proposedalgorithm achieves speedups up to 6 using 8 cores (10k), 10using 32 cores (100k), and 137 using 512 cores (1m). Anotherexperiment using 10k data points is conducted and the resultshows that the algorithm with MapReduce achieves speedupsto 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1393-1402
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Other

Other30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Country/TerritoryUnited States
CityChicago
Period5/23/165/27/16

Keywords

  • Bigdata
  • Clustering
  • DBSCAN
  • Spark framework

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A novel scalable DBSCAN algorithm with spark'. Together they form a unique fingerprint.

Cite this