TY - GEN
T1 - ALinFiK
T2 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025
AU - Pan, Yanzhou
AU - Lin, Huawei
AU - Ran, Yide
AU - Chen, Jiamin
AU - Yu, Xiaodong
AU - Zhao, Weijie
AU - Zhang, Denghui
AU - Xu, Zhaozhuo
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.
AB - Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.
UR - https://www.scopus.com/pages/publications/105027366788
UR - https://www.scopus.com/pages/publications/105027366788#tab=citedBy
U2 - 10.18653/v1/2025.naacl-long.589
DO - 10.18653/v1/2025.naacl-long.589
M3 - Conference contribution
AN - SCOPUS:105027366788
T3 - Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025
SP - 11756
EP - 11771
BT - Long Papers
A2 - Chiruzzo, Luis
A2 - Ritter, Alan
A2 - Wang, Lu
Y2 - 29 April 2025 through 4 May 2025
ER -