TY - GEN
T1 - A novel scalable DBSCAN algorithm with spark
AU - Han, Dianwei
AU - Agrawal, Ankit
AU - Liao, Wei Keng
AU - Choudhary, Alok
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/18
Y1 - 2016/7/18
N2 - DBSCAN is a well-known clustering algorithmwhich is based on density and is able to identify arbitraryshaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPIor OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists tohave an advanced experience to handle communication betweennodes which is a big challenge. We present a new parallel DBSCAN algorithm using thenew big data framework Spark. In order to reduce searchtime, we apply kd-tree in our algorithm. More specifically, wepropose a novel approach to avoid communication betweenexecutors so that we can locally obtain partial clusters moreefficiently. Based on Java API, we select appropriate datastructures carefully: Using Queue to contain neighbors of thedata point, and using Hashtable when checking the statusof and processing the data points. In addition, we use otheradvanced features from Spark to make our implementationmore effective. We implement the algorithm in Java andevaluate its scalability by using different number of processingcores. Our experiments demonstrate that the algorithm wepropose scales up very well. Using data sets containing up to1 million high-dimensional points, we show that our proposedalgorithm achieves speedups up to 6 using 8 cores (10k), 10using 32 cores (100k), and 137 using 512 cores (1m). Anotherexperiment using 10k data points is conducted and the resultshows that the algorithm with MapReduce achieves speedupsto 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.
AB - DBSCAN is a well-known clustering algorithmwhich is based on density and is able to identify arbitraryshaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPIor OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists tohave an advanced experience to handle communication betweennodes which is a big challenge. We present a new parallel DBSCAN algorithm using thenew big data framework Spark. In order to reduce searchtime, we apply kd-tree in our algorithm. More specifically, wepropose a novel approach to avoid communication betweenexecutors so that we can locally obtain partial clusters moreefficiently. Based on Java API, we select appropriate datastructures carefully: Using Queue to contain neighbors of thedata point, and using Hashtable when checking the statusof and processing the data points. In addition, we use otheradvanced features from Spark to make our implementationmore effective. We implement the algorithm in Java andevaluate its scalability by using different number of processingcores. Our experiments demonstrate that the algorithm wepropose scales up very well. Using data sets containing up to1 million high-dimensional points, we show that our proposedalgorithm achieves speedups up to 6 using 8 cores (10k), 10using 32 cores (100k), and 137 using 512 cores (1m). Anotherexperiment using 10k data points is conducted and the resultshows that the algorithm with MapReduce achieves speedupsto 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.
KW - Bigdata
KW - Clustering
KW - DBSCAN
KW - Spark framework
UR - http://www.scopus.com/inward/record.url?scp=84991696447&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84991696447&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2016.57
DO - 10.1109/IPDPSW.2016.57
M3 - Conference contribution
AN - SCOPUS:84991696447
T3 - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
SP - 1393
EP - 1402
BT - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Y2 - 23 May 2016 through 27 May 2016
ER -