TY - JOUR
T1 - DRAGONN
T2 - 39th International Conference on Machine Learning, ICML 2022
AU - Wang, Zhuang
AU - Xu, Zhaozhuo
AU - Wu, Xinyu Crystal
AU - Shrivastava, Anshumali
AU - Ng, T. S.Eugene
N1 - Publisher Copyright:
Copyright © 2022 by the author(s)
PY - 2022
Y1 - 2022
N2 - Data-parallel distributed training (DDT) has become the de-facto standard for accelerating the training of most deep learning tasks on massively parallel hardware. In the DDT paradigm, the communication overhead of gradient synchronization is the major efficiency bottleneck. A widely adopted approach to tackle this issue is gradient sparsification (GS). However, the current GS methods introduce significant new overhead in compressing the gradients, outweighing the communication overhead and becoming the new efficiency bottleneck. In this paper, we propose DRAGONN, a randomized hashing algorithm for GS in DDT. DRAGONN can significantly reduce the compression time by up to 70% compared to state-of-the-art GS approaches, and achieve up to 3.52× speedup in total training throughput.
AB - Data-parallel distributed training (DDT) has become the de-facto standard for accelerating the training of most deep learning tasks on massively parallel hardware. In the DDT paradigm, the communication overhead of gradient synchronization is the major efficiency bottleneck. A widely adopted approach to tackle this issue is gradient sparsification (GS). However, the current GS methods introduce significant new overhead in compressing the gradients, outweighing the communication overhead and becoming the new efficiency bottleneck. In this paper, we propose DRAGONN, a randomized hashing algorithm for GS in DDT. DRAGONN can significantly reduce the compression time by up to 70% compared to state-of-the-art GS approaches, and achieve up to 3.52× speedup in total training throughput.
UR - http://www.scopus.com/inward/record.url?scp=85162758302&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162758302&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85162758302
VL - 162
SP - 23274
EP - 23291
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 17 July 2022 through 23 July 2022
ER -