TY - GEN
T1 - gZCCL
T2 - 38th ACM International Conference on Supercomputing, ICS 2024
AU - Huang, Jiajun
AU - Di, Sheng
AU - Yu, Xiaodong
AU - Zhai, Yujia
AU - Liu, Jinyang
AU - Huang, Yafan
AU - Raffenetti, Ken
AU - Zhou, Hui
AU - Zhao, Kai
AU - Lu, Xiaoyi
AU - Chen, Zizhong
AU - Cappello, Franck
AU - Guo, Yanfei
AU - Thakur, Rajeev
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/5/30
Y1 - 2024/5/30
N2 - GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
AB - GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
KW - Collective Communication
KW - Compression
KW - GPU
UR - http://www.scopus.com/inward/record.url?scp=85196307802&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196307802&partnerID=8YFLogxK
U2 - 10.1145/3650200.3656636
DO - 10.1145/3650200.3656636
M3 - Conference contribution
AN - SCOPUS:85196307802
T3 - Proceedings of the International Conference on Supercomputing
SP - 437
EP - 448
BT - ICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
Y2 - 4 June 2024 through 7 June 2024
ER -