TY - GEN
T1 - Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-trained Language Models
AU - Yang, Guanqun
AU - Dineen, Shay
AU - Lin, Zhipeng
AU - Liu, Xueqing
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Public security vulnerability reports (e.g., CVE reports) play an important role in the maintenance of computer and network systems. Security companies and administrators rely on information from these reports to prioritize tasks on developing and deploying patches to their customers. Since these reports are unstructured texts, automatic information extraction (IE) can help scale up the processing by converting the unstructured reports to structured forms, e.g., software names and versions [8] and vulnerability types [38]. Existing works on automated IE for security vulnerability reports often rely on a large number of labeled training samples [8, 18, 48]. However, creating massive labeled training set is both expensive and time consuming. In this work, for the first time, we propose to investigate this problem where only a small number of labeled training samples are available. In particular, we investigate the performance of fine-tuning several state-of-the-art pre-trained language models on our small training dataset. The results show that with pre-trained language models and carefully tuned hyperparameters, we have reached or slightly outperformed the state-of-the-art system [8] on this task. Consistent with previous two-step process of first fine-tuning on main category and then transfer learning to others as in [7], if otherwise following our proposed approach, the number of required labeled samples substantially decrease in both stages: 90% reduction in fine-tuning from 5758 to 576, and 88.8% reduction in transfer learning with 64 labeled samples per category. Our experiments thus demonstrate the effectiveness of few-sample learning on NER for security vulnerability report. This result opens up multiple research opportunities for few-sample learning for security vulnerability reports, which is discussed in the paper. Our implementation for few-sample vulnerability entity tagger in security reports could be found at https://github.com/guanqun-yang/FewVulnerability.
AB - Public security vulnerability reports (e.g., CVE reports) play an important role in the maintenance of computer and network systems. Security companies and administrators rely on information from these reports to prioritize tasks on developing and deploying patches to their customers. Since these reports are unstructured texts, automatic information extraction (IE) can help scale up the processing by converting the unstructured reports to structured forms, e.g., software names and versions [8] and vulnerability types [38]. Existing works on automated IE for security vulnerability reports often rely on a large number of labeled training samples [8, 18, 48]. However, creating massive labeled training set is both expensive and time consuming. In this work, for the first time, we propose to investigate this problem where only a small number of labeled training samples are available. In particular, we investigate the performance of fine-tuning several state-of-the-art pre-trained language models on our small training dataset. The results show that with pre-trained language models and carefully tuned hyperparameters, we have reached or slightly outperformed the state-of-the-art system [8] on this task. Consistent with previous two-step process of first fine-tuning on main category and then transfer learning to others as in [7], if otherwise following our proposed approach, the number of required labeled samples substantially decrease in both stages: 90% reduction in fine-tuning from 5758 to 576, and 88.8% reduction in transfer learning with 64 labeled samples per category. Our experiments thus demonstrate the effectiveness of few-sample learning on NER for security vulnerability report. This result opens up multiple research opportunities for few-sample learning for security vulnerability reports, which is discussed in the paper. Our implementation for few-sample vulnerability entity tagger in security reports could be found at https://github.com/guanqun-yang/FewVulnerability.
KW - Few-sample named entity recognition
KW - Public security reports
KW - Software vulnerability identification
UR - http://www.scopus.com/inward/record.url?scp=85116366721&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116366721&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-87839-9_3
DO - 10.1007/978-3-030-87839-9_3
M3 - Conference contribution
AN - SCOPUS:85116366721
SN - 9783030878382
T3 - Communications in Computer and Information Science
SP - 55
EP - 78
BT - Deployable Machine Learning for Security Defense - 2nd International Workshop, MLHat 2021, Proceedings
A2 - Wang, Gang
A2 - Ciptadi, Arridhana
A2 - Ahmadzadeh, Ali
T2 - 2nd International Workshop on Deployable Machine Learning for Security Defense, MLHat 2021, co-located with 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021
Y2 - 15 August 2021 through 15 August 2021
ER -