POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 64 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 3.4× and 18.7×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

Original languageEnglish
Title of host publicationPPoPP 2024 - Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Pages454-456
Number of pages3
ISBN (Electronic)9798400704352
DOIs
StatePublished - 2 Mar 2024
Event29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024 - Edinburgh, United Kingdom
Duration: 2 Mar 20246 Mar 2024

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024
Country/TerritoryUnited Kingdom
CityEdinburgh
Period2/03/246/03/24

Keywords

  • Collective Communication
  • Compression
  • GPU

Fingerprint

Dive into the research topics of 'POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters'. Together they form a unique fingerprint.

Cite this