TY - GEN
T1 - HMC-TRAN
T2 - 31st Great Lakes Symposium on VLSI, GLSVLSI 2021
AU - Huang, Shaoyi
AU - Chen, Shiyang
AU - Peng, Hongwu
AU - Manu, Daniel
AU - Kong, Zhenglun
AU - Yuan, Geng
AU - Yang, Lei
AU - Wang, Shusen
AU - Liu, Hang
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/22
Y1 - 2021/6/22
N2 - Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-The-Art.
AB - Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-The-Art.
KW - bert
KW - block weight pruning
KW - low-rank
KW - tensor-core
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85109209660&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85109209660&partnerID=8YFLogxK
U2 - 10.1145/3453688.3461740
DO - 10.1145/3453688.3461740
M3 - Conference contribution
AN - SCOPUS:85109209660
T3 - Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI
SP - 169
EP - 174
BT - GLSVLSI 2021 - Proceedings of the 2021 Great Lakes Symposium on VLSI
Y2 - 22 June 2021 through 25 June 2021
ER -