TY - GEN
T1 - HZCCL
T2 - 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
AU - Huang, Jiajun
AU - Di, Sheng
AU - Yu, Xiaodong
AU - Zhai, Yujia
AU - Liu, Jinyang
AU - Jian, Zizhe
AU - Liang, Xin
AU - Zhao, Kai
AU - Lu, Xiaoyi
AU - Chen, Zizhong
AU - Cappello, Franck
AU - Guo, Yanfei
AU - Thakur, Rajeev
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate h Z C C L with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 ~GB / s, surpassing the conventional DOC workflow by up to 36.53 ×. Moreover, our h Z C C L-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12 × and 6.77 × compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.
AB - As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate h Z C C L with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 ~GB / s, surpassing the conventional DOC workflow by up to 36.53 ×. Moreover, our h Z C C L-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12 × and 6.77 × compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.
KW - Collective Communication
KW - Distributed Computing
KW - Homomorphic Compression
KW - Parallel Algorithm
UR - https://www.scopus.com/pages/publications/85215000099
UR - https://www.scopus.com/inward/citedby.url?scp=85215000099&partnerID=8YFLogxK
U2 - 10.1109/SC41406.2024.00110
DO - 10.1109/SC41406.2024.00110
M3 - Conference contribution
AN - SCOPUS:85215000099
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2024
Y2 - 17 November 2024 through 22 November 2024
ER -