TY - GEN
T1 - LeakageDetector
T2 - 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
AU - Alomar, Eman Abdullah
AU - Demario, Catherine
AU - Shagawat, Roger
AU - Kreiser, Brandon
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Code quality is of paramount importance in all types of software development settings. Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. Data Leakage often results from bad practices in writing ML code. As a result, the model effectively 'memorizes' the data on which it trains, leading to an overly optimistic estimate of the model performance and an inability to make generalized predictions. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. Training data should be used to train the model, evaluation data should be used to repeatedly confirm a model's accuracy, and test data should be used only once to determine the accuracy of a production-ready model. In this paper, we develop Leakagedetector, a Python plugin for the PyCharm IDE that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage. The plugin and its source code are publicly available on GitHub at https://github.com/SE4AIResearchlDataLeakage_Fall2023. The demonstration video can be found on YouTube: https://youtu.be/yXj3wihSaIU.
AB - Code quality is of paramount importance in all types of software development settings. Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. Data Leakage often results from bad practices in writing ML code. As a result, the model effectively 'memorizes' the data on which it trains, leading to an overly optimistic estimate of the model performance and an inability to make generalized predictions. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. Training data should be used to train the model, evaluation data should be used to repeatedly confirm a model's accuracy, and test data should be used only once to determine the accuracy of a production-ready model. In this paper, we develop Leakagedetector, a Python plugin for the PyCharm IDE that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage. The plugin and its source code are publicly available on GitHub at https://github.com/SE4AIResearchlDataLeakage_Fall2023. The demonstration video can be found on YouTube: https://youtu.be/yXj3wihSaIU.
KW - data leakage
KW - machine learning
KW - quality
UR - http://www.scopus.com/inward/record.url?scp=105007306199&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105007306199&partnerID=8YFLogxK
U2 - 10.1109/SANER64311.2025.00089
DO - 10.1109/SANER64311.2025.00089
M3 - Conference contribution
AN - SCOPUS:105007306199
T3 - Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
SP - 844
EP - 849
BT - Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
Y2 - 4 March 2025 through 7 March 2025
ER -