Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models
DOI:
https://doi.org/10.20956/j.v21i1.35552Keywords:
Cardiovascular Disease, Machine Learning, Resampling TechniquesAbstract
Cardiovascular Disease (CVD) or commonly known as Heart Disease is a leading cause of mortality globally, prompting extensive research into predictive models to assess individual risk and plan preventive measures. Machine learning approaches such as Random Forest, Support Vector Machine (SVM), and LASSO Logistic Regression have showed promise. Recent studies have indicated that traditional resampling methods like Random Oversampling, Random Undersampling, and SMOTE may not significantly improve model discrimination. This study aims to evaluate the impact of these techniques on the performance of Cardiovascular Disease (CVD) prediction models, utilizing data from the UCI Machine Learning Heart Disease database. By employing LASSO Logistic Regression, Random Forest, and Support Vector Machine (SVM) with resampling techniques, including Random Oversampling, Random Undersampling, and SMOTE. This research seeks to enhance understanding of model performance in addressing class imbalances within the dataset and contribute to refining cardiovascular disease (CVD) prediction strategies. This study demonstrates that the use of the SMOTE technique significantly enhances the performance of cardiovascular disease (CVD) prediction models. Specifically, when combined with the Random Forest algorithm, SMOTE achieves the best performance in terms of accuracy, sensitivity, and specificity. This highlights the importance of selecting appropriate resampling techniques to handle class imbalance in datasets. Consequently, this research contributes to refining CVD prediction strategies and provides new insights into improving prediction accuracy in imbalanced medical data.Downloads
References
Agresti, A., 2002. Categorical Data Analysis Second Edition. John Wiley & Sons Inc., New York.
Alkhalaf, M., Yu, P., Shen, J., & Deng, C., 2022. A review of the application of machine learning in adult obesity studies. Applied Computing and Intelligence, 2(1), 32–48. https://doi.org/10.3934/aci.2022002
Arabameri, A., Saha, S., Chen, W., Roy, J., Pradhan, B., & Bui, D. T. (2020). Flash flood susceptibility modelling using functional tree and hybrid ensemble techniques. Journal of Hydrology, 587, 125007. https://doi.org/10.1016/j.jhydrol.2020.125007
Bammou, L., Kharchouf, M., Boughanem, H., Douik, A., & El Fazziki, A. (2024). Predictive models for gully erosion susceptibility using machine learning techniques. Environmental Earth Sciences, 83(5), 283. https://doi.org/10.1007/s12665-024-10023-4
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
Cortes, C., & Vapnik, V., 1995. Support-vector networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1). PMID: 20808728 https://doi.org/10.18637/jss.v033.i01
Goel, E. & Abhilasha, E., 2017. Random Forest: A Review. Int. J. Adv. Res. Comput. Sci. Softw. Eng., 7(1), 251-257. https://doi.org/10.23956/ijarcsse.v7i1.006
Han, J., Kamber, M., & Pei, J., 2012. Data Mining Concepts and Techniques. Morgan Kaufmann Publisher.
Indrawati, A., Subagyo, H., Sihombing, A., Wagiyah, & Afandi, S., 2020. Analyzing the impact of resampling method for imbalanced data text in Indonesian scientific articles categorization. Jurnal Baca, 41(2). https://doi.org/10.14203/j.baca.v41i2.563
Kim, S. M., Kim, Y., Jeong, K., Jeong, H., & Kim, J., 2018. Logistic LASSO regression for the diagnosis of breast cancer using clinical demographic data and the BI-RADS lexicon for ultrasonography. Ultrasonography, 37(1), 36-42. https://doi.org/10.14366/usg.17054
Lunardon, N., Menardi, G., & Torelli, N., 2014. ROSE: A Package for Binary Imbalanced Learning. R Journal, 6, 79–89. https://doi.org/10.32614/RJ-2014-008
Ma, Y., & He, H., 2013. Imbalanced Learning: Foundations, Algorithms, and Applications. John Wiley & Sons, Hoboken, NJ, USA.
Pereira, J. M., Basto, M., & Ferreira da Silva, A., 2016. The Logistic Lasso and Ridge Regression in Predicting Corporate Failure. Procedia Economics and Finance, 39, 634-641. https://doi.org/10.1016/S2212-5671(16)30292-2
Roth, G. A., Abate, D., Abate, K. H., Abay, S. M., Abbafati, C., Abbasi, N., et al., 2018. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the global burden of disease study 2017. Lancet, 392(10159), 1736–88. https://doi.org/10.1016/S0140-6736(18)32203-7
Van Goorbergh, R. vd M., Timmerman, D., & Van Calster, B., 2022. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. arXiv Preprint arXiv:220209101. https://doi.org/10.48550/arXiv.2202.09101
Wongvorachan, T., He, S., & Bulut, O., 2023. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information, 14, 54. https://doi.org/10.3390/info14010054
Yang, C., Fridgeirsson, E. A., Kors, J. A., et al., 2024. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J Big Data, 11, 7. https://doi.org/10.1186/s40537-023-00857-7
Zailani, A. U., & Hanun, N. L., 2020. Penerapan Algoritma Klasifikasi Random Forest Untuk Penentuan Kelayakan Pemberian Kredit Di Koperasi Mitra Sejahtera. Infotech: Journal of Technology Information, 6(1), 7-14. https://doi.org/10.37365/jti.v6i1.61
Zhang, J., & Chen, L., 2019. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Computer Assisted Surgery, 24(sup2), 62–72. https://doi.org/10.1080/24699322.2019.1649074
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Jurnal Matematika, Statistika dan Komputasi
This work is licensed under a Creative Commons Attribution 4.0 International License.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Jurnal Matematika, Statistika dan Komputasi is an Open Access journal, all articles are distributed under the terms of the Creative Commons Attribution License, allowing third parties to copy and redistribute the material in any medium or format, transform, and build upon the material, provided the original work is properly cited and states its license. This license allows authors and readers to use all articles, data sets, graphics and appendices in data mining applications, search engines, web sites, blogs and other platforms by providing appropriate reference.