TY - GEN
T1 - Automated Identification of Machine Learning Technical Debt Code Comments
AU - Khanvilkar, Omkar
AU - Mkaouer, Mohamed Wiem
AU - Alomar, Eman Abdullah
AU - Elsaid, Abdelrahman
AU - Chaaben, Amal
AU - Touati, Mohamed
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The rapid integration of machine learning (ML) systems within software projects introduces complex maintenance challenges known as Machine Learning Technical Debt (MLTD). Identifying and managing this technical debt is crucial for the sustainability and efficiency of ML systems. This paper presents a novel framework that leverages natural language processing techniques to detect MLTD types using comments written in the source code. Specifically, the spacy library's textcat_multilabel pipeline is employed to train a multi-label classification model designed to automatically classify code comments into distinct categories of MLTD, such as 'data debt,' 'model debt,' 'configuration debt,' and 'environment debt.' The dataset includes thousands of manually annotated comments from several largescale ML repositories, offering a diverse and comprehensive basis for training and testing the classifier. The approach is detailed through the processes involved in the preprocessing of the comment text, feature extraction, and the selection of appropriate model parameters. Challenges associated with working with sparse and domain-specific language typical of code comments are also discussed. Evaluation metrics show that the classifier achieves robust accuracy and precision across different types of MLTD, providing developers and project managers with a practical tool for early detection and management of MLTD. By automating the identification of technical debt through code comments, this method not only enhances the maintainability of ML projects but also enriches the practices surrounding documentation and proactive debt management in the field of machine learning.
AB - The rapid integration of machine learning (ML) systems within software projects introduces complex maintenance challenges known as Machine Learning Technical Debt (MLTD). Identifying and managing this technical debt is crucial for the sustainability and efficiency of ML systems. This paper presents a novel framework that leverages natural language processing techniques to detect MLTD types using comments written in the source code. Specifically, the spacy library's textcat_multilabel pipeline is employed to train a multi-label classification model designed to automatically classify code comments into distinct categories of MLTD, such as 'data debt,' 'model debt,' 'configuration debt,' and 'environment debt.' The dataset includes thousands of manually annotated comments from several largescale ML repositories, offering a diverse and comprehensive basis for training and testing the classifier. The approach is detailed through the processes involved in the preprocessing of the comment text, feature extraction, and the selection of appropriate model parameters. Challenges associated with working with sparse and domain-specific language typical of code comments are also discussed. Evaluation metrics show that the classifier achieves robust accuracy and precision across different types of MLTD, providing developers and project managers with a practical tool for early detection and management of MLTD. By automating the identification of technical debt through code comments, this method not only enhances the maintainability of ML projects but also enriches the practices surrounding documentation and proactive debt management in the field of machine learning.
KW - machine learning
KW - quality
KW - technical debt
UR - https://www.scopus.com/pages/publications/105017601394
UR - https://www.scopus.com/pages/publications/105017601394#tab=citedBy
U2 - 10.1109/IC_ETC65981.2025.11141122
DO - 10.1109/IC_ETC65981.2025.11141122
M3 - Conference contribution
AN - SCOPUS:105017601394
T3 - Proceedings - 2025 IEEE International Conference on Emerging Technologies and Computing, IC_ETC 2025
BT - Proceedings - 2025 IEEE International Conference on Emerging Technologies and Computing, IC_ETC 2025
T2 - 2025 IEEE International Conference on Emerging Technologies and Computing, IC_ETC 2025
Y2 - 23 June 2025 through 26 June 2025
ER -