TY - GEN
T1 - Accommodating Transformer onto FPGA
T2 - 31st Great Lakes Symposium on VLSI, GLSVLSI 2021
AU - Qi, Panjie
AU - Song, Yuhong
AU - Peng, Hongwu
AU - Huang, Shaoyi
AU - Zhuge, Qingfeng
AU - Sha, Edwin Hsing Mean
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/22
Y1 - 2021/6/22
N2 - Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.
AB - Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.
KW - fpga
KW - model compression
KW - nlp
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85109210883&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85109210883&partnerID=8YFLogxK
U2 - 10.1145/3453688.3461739
DO - 10.1145/3453688.3461739
M3 - Conference contribution
AN - SCOPUS:85109210883
T3 - Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI
SP - 163
EP - 168
BT - GLSVLSI 2021 - Proceedings of the 2021 Great Lakes Symposium on VLSI
Y2 - 22 June 2021 through 25 June 2021
ER -