Abstract
Single-linkage hierarchical clustering is one of the prominent and widely-used data mining techniques for its informative representation of clustering results. However, the parallelization of this algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. Moreover, in many modern applications, new data is continuously added into the already huge datasets. It would be impractical to reapply the clustering algorithm on the augmented datasets from scratch. In this paper, we propose a unified algorithm which can not only cluster the large dataset, but also incorporate the newly arrived data incrementally. More specifically, we formulate the single-linkage hierarchical clustering problem as a Minimum Spanning Tree (MST) construction problem on a complete graph. The algorithm decomposes the complete graph into a set of non-overlapped subgraphs, computes the intermediate sub-MSTs for each subgraph in parallel, and merges all the sub-MSTs to achieve the final solution. In addition, the same framework can treat the incremental data insertion as a separate data subset and integrate it nicely with the existing solution. We implement the unified algorithm by employing MapReduce framework. Using both synthetic and real-world datasets containing up to millions of high-dimensional points, we show that the proposed algorithm achieves a scalable speedup up to 200 on 300 computer cores for the base dataset and a speedup up to 120 for the dataset with maximum 5% random insertion. Copyright (c) Society for Modeling & Simulation International (SCS).
Original language | English |
---|---|
Pages (from-to) | 83-92 |
Number of pages | 10 |
Journal | Simulation Series |
Volume | 47 |
Issue number | 4 |
State | Published - 2015 |
Event | 23rd High Performance Computing Symposium, HPC 2015, Part of the 2015 Spring Simulation Multi-Conference, SpringSim 2015 - Alexandria, United States Duration: 12 Apr 2015 → 15 Apr 2015 |
Keywords
- Hierarchical clustering
- Incremental hierarchical clustering
- MapReduce