TY - GEN
T1 - A scalable hierarchical clustering algorithm using spark
AU - Jin, Chen
AU - Liu, Ruoqian
AU - Hendrix, William
AU - Agrawal, Ankit
AU - Choudhary, Alok
AU - Chen, Zhengzhang
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/8/10
Y1 - 2015/8/10
N2 - Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.
AB - Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.
KW - Hierarchical Clustering
KW - Minimum Spanning Tree
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=84959533782&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959533782&partnerID=8YFLogxK
U2 - 10.1109/BigDataService.2015.67
DO - 10.1109/BigDataService.2015.67
M3 - Conference contribution
AN - SCOPUS:84959533782
T3 - Proceedings - 2015 IEEE 1st International Conference on Big Data Computing Service and Applications, BigDataService 2015
SP - 418
EP - 427
BT - Proceedings - 2015 IEEE 1st International Conference on Big Data Computing Service and Applications, BigDataService 2015
T2 - 1st IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015
Y2 - 30 March 2015 through 3 April 2015
ER -