A scalable hierarchical clustering algorithm using spark

Chen Jin, Ruoqian Liu, William Hendrix, Ankit Agrawal, Alok Choudhary, Zhengzhang Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

45 Scopus citations

Abstract

Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 1st International Conference on Big Data Computing Service and Applications, BigDataService 2015
Pages418-427
Number of pages10
ISBN (Electronic)9781479981281
DOIs
StatePublished - 10 Aug 2015
Event1st IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015 - San Francisco, United States
Duration: 30 Mar 20153 Apr 2015

Publication series

NameProceedings - 2015 IEEE 1st International Conference on Big Data Computing Service and Applications, BigDataService 2015

Conference

Conference1st IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015
Country/TerritoryUnited States
CitySan Francisco
Period30/03/153/04/15

Keywords

  • Hierarchical Clustering
  • Minimum Spanning Tree
  • Spark

Fingerprint

Dive into the research topics of 'A scalable hierarchical clustering algorithm using spark'. Together they form a unique fingerprint.

Cite this