gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

  • Jiajun Huang
  • , Sheng Di
  • , Xiaodong Yu
  • , Yujia Zhai
  • , Jinyang Liu
  • , Yafan Huang
  • , Ken Raffenetti
  • , Hui Zhou
  • , Kai Zhao
  • , Xiaoyi Lu
  • , Zizhong Chen
  • , Franck Cappello
  • , Yanfei Guo
  • , Rajeev Thakur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

Original languageEnglish
Title of host publicationICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
Pages437-448
Number of pages12
ISBN (Electronic)9798400706103
DOIs
StatePublished - 30 May 2024
Event38th ACM International Conference on Supercomputing, ICS 2024 - Kyoto, Japan
Duration: 4 Jun 20247 Jun 2024

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference38th ACM International Conference on Supercomputing, ICS 2024
Country/TerritoryJapan
CityKyoto
Period4/06/247/06/24

Keywords

  • Collective Communication
  • Compression
  • GPU

Fingerprint

Dive into the research topics of 'gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters'. Together they form a unique fingerprint.

Cite this