TY - GEN
T1 - Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models
AU - Li, Bingbing
AU - Wang, Zigeng
AU - Huang, Shaoyi
AU - Bragin, Mikhail
AU - Li, Ji
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2023 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Pruning has been extensively studied in Transformer-based language models to improve efficiency. Typically, we zero (prune) unimportant model weights and train a derived compact model to improve final accuracy. For pruned weights, we treat them as useless and discard them. This usually leads to significant model accuracy degradation. In this paper, we focus on attention head pruning as head attention is a key component of the transformer-based language models and provides interpretable knowledge meaning. We reveal the relationship between pruned attention heads and retained heads and provide a solution to recycle the discarded knowledge from the pruned heads, named peer distillation. We also develop an automatic framework to locate the to-be-pruned attention heads in each layer, freeing the time-consuming human labor in tuning hyperparameters. Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. By recycling discarded knowledge from pruned heads, the proposed method maintains model performance across all nine tasks while reducing heads by over 58% on average and outperforms state-of-the-art techniques (e.g., Random, HISP, L0 Norm, SMP).
AB - Pruning has been extensively studied in Transformer-based language models to improve efficiency. Typically, we zero (prune) unimportant model weights and train a derived compact model to improve final accuracy. For pruned weights, we treat them as useless and discard them. This usually leads to significant model accuracy degradation. In this paper, we focus on attention head pruning as head attention is a key component of the transformer-based language models and provides interpretable knowledge meaning. We reveal the relationship between pruned attention heads and retained heads and provide a solution to recycle the discarded knowledge from the pruned heads, named peer distillation. We also develop an automatic framework to locate the to-be-pruned attention heads in each layer, freeing the time-consuming human labor in tuning hyperparameters. Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. By recycling discarded knowledge from pruned heads, the proposed method maintains model performance across all nine tasks while reducing heads by over 58% on average and outperforms state-of-the-art techniques (e.g., Random, HISP, L0 Norm, SMP).
UR - http://www.scopus.com/inward/record.url?scp=85170398973&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85170398973&partnerID=8YFLogxK
U2 - 10.24963/ijcai.2023/568
DO - 10.24963/ijcai.2023/568
M3 - Conference contribution
AN - SCOPUS:85170398973
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 5113
EP - 5121
BT - Proceedings of the 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
A2 - Elkind, Edith
T2 - 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
Y2 - 19 August 2023 through 25 August 2023
ER -