TY - GEN
T1 - E.T.
T2 - 33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021
AU - Chen, Shiyang
AU - Huang, Shaoyi
AU - Pandey, Santosh
AU - Li, Bingbing
AU - Gao, Guang R.
AU - Zheng, Long
AU - Ding, Caiwen
AU - Liu, Hang
N1 - Publisher Copyright:
© 2021 IEEE Computer Society. All rights reserved.
PY - 2021/11/14
Y1 - 2021/11/14
N2 - Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T.That rE-Thinks self-Attention computation for Transformer models on GPUs with the following contributions: First, we introduce a novel self-Attention architecture, which encompasses two tailored self-Attention operators with corresponding sequence length-Aware optimizations, and operation reordering optimizations. Second, we present an attention-Aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.
AB - Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T.That rE-Thinks self-Attention computation for Transformer models on GPUs with the following contributions: First, we introduce a novel self-Attention architecture, which encompasses two tailored self-Attention operators with corresponding sequence length-Aware optimizations, and operation reordering optimizations. Second, we present an attention-Aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.
UR - http://www.scopus.com/inward/record.url?scp=85119966777&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119966777&partnerID=8YFLogxK
U2 - 10.1145/3458817.3476138
DO - 10.1145/3458817.3476138
M3 - Conference contribution
AN - SCOPUS:85119966777
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2021
Y2 - 14 November 2021 through 19 November 2021
ER -