TY - GEN
T1 - POSTER
T2 - 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024
AU - Huang, Jiajun
AU - Di, Sheng
AU - Yu, Xiaodong
AU - Zhai, Yujia
AU - Liu, Jinyang
AU - Huang, Yafan
AU - Raffenetti, Ken
AU - Zhou, Hui
AU - Zhao, Kai
AU - Chen, Zizhong
AU - Cappello, Franck
AU - Guo, Yanfei
AU - Thakur, Rajeev
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/3/2
Y1 - 2024/3/2
N2 - GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 64 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 3.4× and 18.7×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
AB - GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 64 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 3.4× and 18.7×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
KW - Collective Communication
KW - Compression
KW - GPU
UR - http://www.scopus.com/inward/record.url?scp=85187205534&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85187205534&partnerID=8YFLogxK
U2 - 10.1145/3627535.3638467
DO - 10.1145/3627535.3638467
M3 - Conference contribution
AN - SCOPUS:85187205534
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 454
EP - 456
BT - PPoPP 2024 - Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Y2 - 2 March 2024 through 6 March 2024
ER -