WHEN DATA AUGMENTATION HURTS: A SYSTEMATIC EVALUATION OF SMOTE-BASED TECHNIQUES ON MEDICAL DATASETS
Main Article Content
Abstract
Data augmentation techniques, particularly Synthetic Minority Over-sampling Technique (SMOTE) and its variants, are routinely applied to address class imbalance in medical datasets. However, the assumption that augmentation universally improves classification performance remains largely unvalidated. This study presents a systematic evaluation of four SMOTE-based augmentation methods across three medical datasets to determine when these techniques help or harm model performance. The research evaluated SMOTE, ADASYN, BorderlineSMOTE, and SVM-SMOTE on breast cancer diagnosis, heart disease prediction, and diabetes detection datasets, representing varying levels of class imbalance (ratios: 1.17 to 2.02) and baseline performance (F1 scores: 0.667 to 0.966). Random Forest classifiers were employed with both standard and regularized configurations to ensure robust findings. Each augmentation method underwent rigorous evaluation through 10 independent runs with statistical significance testing and effect size analysis. Results revealed that augmentation significantly degraded performance on the high-performing Breast Cancer dataset, with all methods showing statistically significant decreases (p < 0.05) and F1 scores dropping by up to 2.2%. Conversely, the Pima Diabetes dataset, characterized by lower baseline performance and higher imbalance, showed improvements up to 4.76% with SVM-SMOTE. Heart Disease exhibited mixed results, with only ADASYN achieving meaningful improvement. Analysis uncovered a strong negative correlation (r = -0.997) between baseline model performance and augmentation effectiveness, providing a more reliable predictor than traditional class imbalance ratios.
The study establishes an evidence-based decision framework: augmentation should be avoided when baseline F1 exceeds 0.95 or imbalance ratios fall below 1.5, considered for baseline F1 below 0.70 with imbalance ratios above 1.8, and carefully validated for intermediate cases. These findings challenge current practices of routine augmentation application and demonstrate that synthetic sample generation can blur decision boundaries in well-separated feature spaces. The research provides practitioners with validated guidelines for determining when augmentation techniques genuinely improve medical classifiers versus when they cause harm, ultimately supporting more effective development of clinical decision support systems.
Downloads
Article Details
COPYRIGHT
Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
- The journal allows the author(s) to retain publishing rights without restrictions.
- The journal allows the author(s) to hold the copyright without restrictions.