TY - GEN
T1 - Addressing Overfitting in an Imbalanced Dataset for MS Progression Prediction
AU - Pilehvari, Shima
AU - Peng, Wei
AU - Morgan, Yasser
AU - Sahraian, Mohammad Ali
AU - Eskandarieh, Sharareh
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Overfitting is a common problem during model training, particularly for binary medical datasets with class imbalance. This research specifically addresses this issue in predicting Multiple Sclerosis (MS) progression, with the primary goal of improving model accuracy and reliability. By investigating various data resampling techniques, ensemble methods, feature extraction, and model regularization, the study thoroughly evaluates the effectiveness of these strategies in enhancing stability and performance for highly imbalanced datasets. Compared to prior studies, this research advances existing approaches by integrating Kernel Principal Component Analysis (KPCA), moderate under-sampling, Synthetic Minority Oversampling Technique (SMOTE), and post-processing techniques, including Youden’s J Statistic and manual threshold adjustments. This comprehensive strategy significantly reduced overfitting while improving the generalization of models, particularly the Multilayer Perceptron (MLP), which achieved an Area Under the Curve (AUC) of 0.98—outperforming previous models in similar applications. These findings establish important best practices for developing robust prognostic models for MS progression and underscore the importance of tailored solutions in complex medical prediction tasks.
AB - Overfitting is a common problem during model training, particularly for binary medical datasets with class imbalance. This research specifically addresses this issue in predicting Multiple Sclerosis (MS) progression, with the primary goal of improving model accuracy and reliability. By investigating various data resampling techniques, ensemble methods, feature extraction, and model regularization, the study thoroughly evaluates the effectiveness of these strategies in enhancing stability and performance for highly imbalanced datasets. Compared to prior studies, this research advances existing approaches by integrating Kernel Principal Component Analysis (KPCA), moderate under-sampling, Synthetic Minority Oversampling Technique (SMOTE), and post-processing techniques, including Youden’s J Statistic and manual threshold adjustments. This comprehensive strategy significantly reduced overfitting while improving the generalization of models, particularly the Multilayer Perceptron (MLP), which achieved an Area Under the Curve (AUC) of 0.98—outperforming previous models in similar applications. These findings establish important best practices for developing robust prognostic models for MS progression and underscore the importance of tailored solutions in complex medical prediction tasks.
KW - Feature extraction
KW - Imbalanced data
KW - Multiple sclerosis (MS)
KW - Overfitting
KW - Post-processing techniques
KW - Resampling techniques
UR - https://www.scopus.com/pages/publications/105020389411
UR - https://www.scopus.com/pages/publications/105020389411#tab=citedBy
U2 - 10.1007/978-981-96-6938-7_39
DO - 10.1007/978-981-96-6938-7_39
M3 - Conference contribution
AN - SCOPUS:105020389411
SN - 9789819669370
T3 - Lecture Notes in Networks and Systems
SP - 467
EP - 481
BT - Proceedings of 10th International Congress on Information and Communication Technology, ICICT 2025
A2 - Yang, Xin-She
A2 - Sherratt, R. Simon
A2 - Dey, Nilanjan
A2 - Joshi, Amit
T2 - 10th International Congress on Information and Communication Technology, ICICT 2025
Y2 - 18 February 2025 through 21 February 2025
ER -