gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

Original languageEnglish
Title of host publicationICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
Pages437-448
Number of pages12
ISBN (Electronic)9798400706103
DOIs
StatePublished - 30 May 2024
Event38th ACM International Conference on Supercomputing, ICS 2024 - Kyoto, Japan
Duration: 4 Jun 20247 Jun 2024

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference38th ACM International Conference on Supercomputing, ICS 2024
Country/TerritoryJapan
CityKyoto
Period4/06/247/06/24

Keywords

  • Collective Communication
  • Compression
  • GPU

Fingerprint

Dive into the research topics of 'gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters'. Together they form a unique fingerprint.

Cite this