TY - GEN
T1 - An Automatic and Efficient BERT Pruning for Edge AI Systems
AU - Huang, Shaoyi
AU - Liu, Ning
AU - Liang, Yueying
AU - Peng, Hongwu
AU - Li, Hongjia
AU - Xu, Dongkuan
AU - Xie, Mimi
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good"sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT. On QNLI and RTE, we obtain 75% and 42.8% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERTBASE encoder on Xilinx Alveo U200 FPGA board has a 1.83× speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated sub-networks of BERTBASE model on computation restricted devices.
AB - With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good"sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT. On QNLI and RTE, we obtain 75% and 42.8% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERTBASE encoder on Xilinx Alveo U200 FPGA board has a 1.83× speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated sub-networks of BERTBASE model on computation restricted devices.
KW - acceleration
KW - deep learning
KW - pruning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85133792443&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85133792443&partnerID=8YFLogxK
U2 - 10.1109/ISQED54688.2022.9806197
DO - 10.1109/ISQED54688.2022.9806197
M3 - Conference contribution
AN - SCOPUS:85133792443
T3 - Proceedings - International Symposium on Quality Electronic Design, ISQED
BT - Proceedings of the 23rd International Symposium on Quality Electronic Design, ISQED 2022
T2 - 23rd International Symposium on Quality Electronic Design, ISQED 2022
Y2 - 6 April 2022 through 7 April 2022
ER -