TY - GEN
T1 - COMPSO
T2 - 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2025
AU - Sun, Baixi
AU - Liu, Weijin
AU - Pauloski, J. Gregory
AU - Tian, Jiannan
AU - Jia, Jinda
AU - Wang, Daoce
AU - Zhang, Boyuan
AU - Zheng, Mingkai
AU - Di, Sheng
AU - Jin, Sian
AU - Zhang, Zhao
AU - Yu, Xiaodong
AU - Iskra, Kamil A.
AU - Beckman, Pete
AU - Tan, Guangming
AU - Tao, Dingwen
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/2/28
Y1 - 2025/2/28
N2 - Second-order optimization methods have been developed to enhance convergence and generalization in deep neural network (DNN) training compared to first-order methods like Stochastic Gradient Descent (SGD). However, these methods face challenges in distributed settings due to high communication overhead. Gradient compression, a technique commonly used to accelerate communication for first-order approaches, often results in low communication reduction ratios, decreased model accuracy, and/or high compression overhead when applied to second-order methods. To address these limitations, we introduce a novel gradient compression method for second-order optimizers called COMPSO. This method effectively reduces communication costs while preserving the advantages of second-order optimization. COMPSO employs stochastic rounding to maintain accuracy and filters out minor gradients to improve compression ratios. Additionally, we develop GPU optimizations to minimize compression overhead and performance modeling to ensure end-to-end performance gains across various systems. Evaluation of COMPSO on different DNN models shows that it achieves a compression ratio of 22.1×, reduces communication time by 14.2×, and improves overall performance by 1.9×, all without any drop in model accuracy.
AB - Second-order optimization methods have been developed to enhance convergence and generalization in deep neural network (DNN) training compared to first-order methods like Stochastic Gradient Descent (SGD). However, these methods face challenges in distributed settings due to high communication overhead. Gradient compression, a technique commonly used to accelerate communication for first-order approaches, often results in low communication reduction ratios, decreased model accuracy, and/or high compression overhead when applied to second-order methods. To address these limitations, we introduce a novel gradient compression method for second-order optimizers called COMPSO. This method effectively reduces communication costs while preserving the advantages of second-order optimization. COMPSO employs stochastic rounding to maintain accuracy and filters out minor gradients to improve compression ratios. Additionally, we develop GPU optimizations to minimize compression overhead and performance modeling to ensure end-to-end performance gains across various systems. Evaluation of COMPSO on different DNN models shows that it achieves a compression ratio of 22.1×, reduces communication time by 14.2×, and improves overall performance by 1.9×, all without any drop in model accuracy.
KW - data compression
KW - Deep learning
KW - distributed training
KW - K-FAC
KW - second-order optimization
UR - http://www.scopus.com/inward/record.url?scp=105000373525&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000373525&partnerID=8YFLogxK
U2 - 10.1145/3710848.3710852
DO - 10.1145/3710848.3710852
M3 - Conference contribution
AN - SCOPUS:105000373525
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 212
EP - 224
BT - PPoPP 2025 - Proceedings of the 2025 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Y2 - 1 March 2025 through 5 March 2025
ER -