POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters

  • Jiajun Huang
  • , Sheng Di
  • , Xiaodong Yu
  • , Yujia Zhai
  • , Jinyang Liu
  • , Yafan Huang
  • , Ken Raffenetti
  • , Hui Zhou
  • , Kai Zhao
  • , Zizhong Chen
  • , Franck Cappello
  • , Yanfei Guo
  • , Rajeev Thakur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 64 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 3.4× and 18.7×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

Original languageEnglish
Title of host publicationPPoPP 2024 - Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Pages454-456
Number of pages3
ISBN (Electronic)9798400704352
DOIs
StatePublished - 20 Feb 2024
Event29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024 - Edinburgh, United Kingdom
Duration: 2 Mar 20246 Mar 2024

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
ISSN (Print)1542-0205

Conference

Conference29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024
Country/TerritoryUnited Kingdom
CityEdinburgh
Period2/03/246/03/24

Keywords

  • Collective Communication
  • Compression
  • GPU

Fingerprint

Dive into the research topics of 'POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters'. Together they form a unique fingerprint.

Cite this