TY - GEN
T1 - VulLibGen
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Chen, Tianyu
AU - Li, Lin
AU - Zhu, Liuchuan
AU - Li, Zongyang
AU - Liu, Xueqing
AU - Liang, Guangtai
AU - Wang, Qianxiang
AU - Xie, Tao
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem. However, it is challenging for existing work on affected package identification to achieve high precision. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models. To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the precision, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel post-processing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average precision of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average precision in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60 pairs to GitHub Advisory (covers four ecosystems) and 34 of them have been accepted and merged.
AB - Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem. However, it is challenging for existing work on affected package identification to achieve high precision. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models. To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the precision, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel post-processing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average precision of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average precision in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60 pairs to GitHub Advisory (covers four ecosystems) and 34 of them have been accepted and merged.
UR - http://www.scopus.com/inward/record.url?scp=85204442295&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204442295&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.acl-long.527
DO - 10.18653/v1/2024.acl-long.527
M3 - Conference contribution
AN - SCOPUS:85204442295
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 9767
EP - 9780
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
Y2 - 11 August 2024 through 16 August 2024
ER -