TY - GEN
T1 - Accelerating transformer-based deep learning models on fpgas using column balanced block pruning
AU - Peng, Hongwu
AU - Huang, Shaoyi
AU - Geng, Tong
AU - Li, Ang
AU - Jiang, Weiwen
AU - Liu, Hang
AU - Wang, Shusen
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/4/7
Y1 - 2021/4/7
N2 - Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer, has been investigated on GPUs. However, the Transformer acceleration using weight pruning on field-programmable gate array (FPGAs) remains unexplored. This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model with proper hardware scheduling, and the experiments show that the Transformer inference on FPGA achieves 10.35 ms latency with the batch size of 32, which is $10.96 \times$ speed up comparing to CPU platform and $2.08 \times$ speed up comparing to GPU platform.
AB - Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer, has been investigated on GPUs. However, the Transformer acceleration using weight pruning on field-programmable gate array (FPGAs) remains unexplored. This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model with proper hardware scheduling, and the experiments show that the Transformer inference on FPGA achieves 10.35 ms latency with the batch size of 32, which is $10.96 \times$ speed up comparing to CPU platform and $2.08 \times$ speed up comparing to GPU platform.
KW - Acceleration
KW - Deep learning
KW - FPGA
KW - Pruning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85105997156&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105997156&partnerID=8YFLogxK
U2 - 10.1109/ISQED51717.2021.9424344
DO - 10.1109/ISQED51717.2021.9424344
M3 - Conference contribution
AN - SCOPUS:85105997156
T3 - Proceedings - International Symposium on Quality Electronic Design, ISQED
SP - 142
EP - 148
BT - Proceedings of the 22nd International Symposium on Quality Electronic Design, ISQED 2021
T2 - 22nd International Symposium on Quality Electronic Design, ISQED 2021
Y2 - 7 April 2021 through 9 April 2021
ER -