Comparative Performance Analysis of Selected Machine Learning Algorithms and the Stacking Ensemble Method for Prediction of the Type II Diabetes Disease

Nathan Zoakah; Augustine Shey Nsang; Abel Ajibesin; Ayuba Zoakah

doi:10.54287/gujsa.1531997

Research Article

Comparative Performance Analysis of Selected Machine Learning Algorithms and the Stacking Ensemble Method for Prediction of the Type II Diabetes Disease

Year 2024, Volume: 11 Issue: 3, 622 - 646, 30.09.2024

Nathan Zoakah , Augustine Shey Nsang , Abel Ajibesin , Ayuba Zoakah

https://doi.org/10.54287/gujsa.1531997

Abstract

Diabetes is a prevalent non-communicable disease affecting many people globally. The common risk factors are obesity, age, lack of exercise, lifestyle, genetic factors, high blood pressure, and poor diet. Early identification of this condition can help prevent subsequent complications, including heart attacks, lower limb amputations, nerve damage, and blindness. Data mining and machine learning have become popular and successful methods of identifying numerous diseases, including Diabetes, using clinical data over the years. This study focuses on the principles and processes of Naïve Bayes, Support Vector Machines, Logistic Regression, Decision Tree, and Random Forest algorithms for diabetes prediction, using the Scikit-learn inbuilt libraries for the experiments. Furthermore, we ensemble all five machine learning models to produce a single stacked ensemble model. Data preprocessing techniques such as scaling, missing data removal, dimensionality reduction, and balancing of target class were performed on the Jos Urban Diabetes dataset used for this study. The comparison of the algorithms' performances across various evaluation metrics, demonstrates that the Support Vector Machines algorithm outperform all others in terms of Accuracy, Precision, Sensitivity, and Matthew’s Correlation Coefficient with scores of 96.11%, 91.61%, 85.67%, and 82.59% respectively with 10-fold cross-validation. Furthermore, the Stacked Ensemble Method model had the best Area Under the Receiver Operating Characteristic Curve scores of 98.47% with 10-fold cross-validation.

Keywords

Machine Learning, Diabetes, Support Vector Machine, Stacked Ensemble Method

References

Armstrong, A. (2022, March 1). Python in Healthcare: AI Applications in Hospitals. https://www.datacamp.com/blog/python-in-healthcare-ai-applications-in-hospitals?utm_medium=email&utm_source=customerio&utm_id=7430059&utm_campaign=dc_insights&utm_term=v2blog
Bhatia, P. (2019). Data mining and data warehousing: Principles and practical techniques. Cambridge University Press.
Birjais, R., Mourya, A. K., Chauhan, R., & Kaur, H. (2019). Prediction and diagnosis of future diabetes risk: A machine learning approach. SN Applied Sciences, 1(9), 1112. https://doi.org/10.1007/s42452-019-1117-9
Choudhary, D. (2021, April 18). Bootstrapping and OOB samples in Random Forests. Analytics Vidhya. https://medium.com/analytics-vidhya/bootstrapping-and-oob-samples-in-random-forests-6e083b6bc341
Choudhury, A., & Gupta, D. (2019). A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques. In: J. Kalita, V. E. Balas, S. Borah, & R. Pradhan (Eds.), Recent Developments in Machine Learning and Data Analytics (Vol. 740, pp. 67-78). Springer Singapore. https://doi.org/10.1007/978-981-13-1280-9_6
Gandhi, R. (2018, May 17). Naive Bayes Classifier. Medium. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
Harrison, G. (2022, February 28). A Deep Dive into Stacking Ensemble Machine Learning—Part I. Medium. https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3
Ibrahim, I., & Abdulazeez, A. (2021). The Role of Machine Learning Algorithms for Diagnosing Diseases. Journal of Applied Science and Technology Trends, 2(01), 10-19. https://doi.org/10.38094/jastt20179
IDF (International Diabetes Federation) (2021). IDF Diabetes Atlas 10th ed.
Jakkula, V. (2010) Tutorial on Support Vector Machine (SVM).
Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432-439. https://doi.org/10.1016/j.icte.2021.02.004
Lanhenke, M. (2022, May 1). Implementing Support Vector Machine From Scratch. Medium. https://towardsdatascience.com/implementing-svm-from-scratch-784e4ad0bc6a
Loeber, P. (2019a, September 29). Naive Bayes in Python—Machine Learning From Scratch 05—Python Tutorial—YouTube. https://www.youtube.com/watch?v=BqUmKsfSWho&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=5
Loeber, P. (2019b, November 22). Decision Tree in Python Part 2/2—Machine Learning From Scratch 09—Python Tutorial. https://www.youtube.com/watch?v=Bqi7EFFvNOg
Loeber, P. (2019c, November 27). Random Forest in Python—Machine Learning From Scratch 10—Python Tutorial. https://www.youtube.com/watch?v=Oq1cKjR8hNo
Maniruzzaman, Md., Rahman, Md. J., Ahammed, B., & Abedin, Md. M. (2020). Classification and prediction of diabetes disease using machine learning paradigm. Health Information Science and Systems, 8(1), 7. https://doi.org/10.1007/s13755-019-0095-z
Normalized Nerd (Director). (2021, January 13). Decision Tree Classification Clearly Explained! https://www.youtube.com/watch?v=ZVR2Way4nwQ
Pranto, B., Mehnaz, S., Mahid, E. B., Sadman, I. M., Rahman, A., & Momen, S. (2020). Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information, 11(8), 374. https://doi.org/10.3390/info11080374
Prasanna, S. (2019). Machine Learning with Python. 1, 167.
Punthakee, Z., Goldenberg, R., & Katz, P. (2018). Definition, Classification and Diagnosis of Diabetes, Prediabetes and Metabolic Syndrome. Canadian Journal of Diabetes, 42, S10-S15. https://doi.org/10.1016/j.jcjd.2017.10.003
Rokach, L. (2009). Pattern Classification Using Ensemble Methods (Illustrated edition, Vol. 75). World Scientific Publishing Company.
Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: Concepts, techniques and applications in Python (1st ed.). John Wiley & Sons.
Singh, H. (2021, March 30). Variants of Stacking | Types of Stacking—Advanced Ensemble Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/advanced-ensemble-learning-technique-stacking-and-its-variants/
Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, 1578-1585.
Sruthi, E. R. (2021, June 17). Random Forest | Introduction to Random Forest Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
Swaminathan, S. (2019, January 18). Logistic Regression—Detailed Overview. Medium. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
Tigga, N. P., & Garg, S. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167, 706-716.
WHO (World Health Organization) (2021, November 10). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes
Zhang, C., & Ma, Y. (Eds.). (2012). Ensemble Machine Learning: Methods and Applications (2012th edition). Springer.
Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179.
Zimmet, P., Alberti, K. G., Magliano, D. J., & Bennett, P. H. (2016). Diabetes mellitus statistics on prevalence and mortality: Facts and fallacies. Nature Reviews Endocrinology, 12(10), 616-622.

Year 2024, Volume: 11 Issue: 3, 622 - 646, 30.09.2024

Nathan Zoakah , Augustine Shey Nsang , Abel Ajibesin , Ayuba Zoakah

https://doi.org/10.54287/gujsa.1531997

Abstract

References

Armstrong, A. (2022, March 1). Python in Healthcare: AI Applications in Hospitals. https://www.datacamp.com/blog/python-in-healthcare-ai-applications-in-hospitals?utm_medium=email&utm_source=customerio&utm_id=7430059&utm_campaign=dc_insights&utm_term=v2blog
Bhatia, P. (2019). Data mining and data warehousing: Principles and practical techniques. Cambridge University Press.
Birjais, R., Mourya, A. K., Chauhan, R., & Kaur, H. (2019). Prediction and diagnosis of future diabetes risk: A machine learning approach. SN Applied Sciences, 1(9), 1112. https://doi.org/10.1007/s42452-019-1117-9
Choudhary, D. (2021, April 18). Bootstrapping and OOB samples in Random Forests. Analytics Vidhya. https://medium.com/analytics-vidhya/bootstrapping-and-oob-samples-in-random-forests-6e083b6bc341
Choudhury, A., & Gupta, D. (2019). A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques. In: J. Kalita, V. E. Balas, S. Borah, & R. Pradhan (Eds.), Recent Developments in Machine Learning and Data Analytics (Vol. 740, pp. 67-78). Springer Singapore. https://doi.org/10.1007/978-981-13-1280-9_6
Gandhi, R. (2018, May 17). Naive Bayes Classifier. Medium. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
Harrison, G. (2022, February 28). A Deep Dive into Stacking Ensemble Machine Learning—Part I. Medium. https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3
Ibrahim, I., & Abdulazeez, A. (2021). The Role of Machine Learning Algorithms for Diagnosing Diseases. Journal of Applied Science and Technology Trends, 2(01), 10-19. https://doi.org/10.38094/jastt20179
IDF (International Diabetes Federation) (2021). IDF Diabetes Atlas 10th ed.
Jakkula, V. (2010) Tutorial on Support Vector Machine (SVM).
Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432-439. https://doi.org/10.1016/j.icte.2021.02.004
Lanhenke, M. (2022, May 1). Implementing Support Vector Machine From Scratch. Medium. https://towardsdatascience.com/implementing-svm-from-scratch-784e4ad0bc6a
Loeber, P. (2019a, September 29). Naive Bayes in Python—Machine Learning From Scratch 05—Python Tutorial—YouTube. https://www.youtube.com/watch?v=BqUmKsfSWho&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=5
Loeber, P. (2019b, November 22). Decision Tree in Python Part 2/2—Machine Learning From Scratch 09—Python Tutorial. https://www.youtube.com/watch?v=Bqi7EFFvNOg
Loeber, P. (2019c, November 27). Random Forest in Python—Machine Learning From Scratch 10—Python Tutorial. https://www.youtube.com/watch?v=Oq1cKjR8hNo
Maniruzzaman, Md., Rahman, Md. J., Ahammed, B., & Abedin, Md. M. (2020). Classification and prediction of diabetes disease using machine learning paradigm. Health Information Science and Systems, 8(1), 7. https://doi.org/10.1007/s13755-019-0095-z
Normalized Nerd (Director). (2021, January 13). Decision Tree Classification Clearly Explained! https://www.youtube.com/watch?v=ZVR2Way4nwQ
Pranto, B., Mehnaz, S., Mahid, E. B., Sadman, I. M., Rahman, A., & Momen, S. (2020). Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information, 11(8), 374. https://doi.org/10.3390/info11080374
Prasanna, S. (2019). Machine Learning with Python. 1, 167.
Punthakee, Z., Goldenberg, R., & Katz, P. (2018). Definition, Classification and Diagnosis of Diabetes, Prediabetes and Metabolic Syndrome. Canadian Journal of Diabetes, 42, S10-S15. https://doi.org/10.1016/j.jcjd.2017.10.003
Rokach, L. (2009). Pattern Classification Using Ensemble Methods (Illustrated edition, Vol. 75). World Scientific Publishing Company.
Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: Concepts, techniques and applications in Python (1st ed.). John Wiley & Sons.
Singh, H. (2021, March 30). Variants of Stacking | Types of Stacking—Advanced Ensemble Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/advanced-ensemble-learning-technique-stacking-and-its-variants/
Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, 1578-1585.
Sruthi, E. R. (2021, June 17). Random Forest | Introduction to Random Forest Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
Swaminathan, S. (2019, January 18). Logistic Regression—Detailed Overview. Medium. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
Tigga, N. P., & Garg, S. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167, 706-716.
WHO (World Health Organization) (2021, November 10). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes
Zhang, C., & Ma, Y. (Eds.). (2012). Ensemble Machine Learning: Methods and Applications (2012th edition). Springer.
Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179.
Zimmet, P., Alberti, K. G., Magliano, D. J., & Bennett, P. H. (2016). Diabetes mellitus statistics on prevalence and mortality: Facts and fallacies. Nature Reviews Endocrinology, 12(10), 616-622.

There are 31 citations in total.

Details

Primary Language	English
Subjects	Machine Learning (Other)
Journal Section	Information and Computing Sciences
Authors	Nathan Zoakah 0009-0000-3873-8471 Augustine Shey Nsang 0000-0002-6466-9032 Abel Ajibesin 0000-0001-6518-0231 Ayuba Zoakah 0000-0003-1856-7753
Publication Date	September 30, 2024
Submission Date	August 12, 2024
Acceptance Date	September 16, 2024
Published in Issue	Year 2024 Volume: 11 Issue: 3

Cite

APA	Zoakah, N., Shey Nsang, A., Ajibesin, A., Zoakah, A. (2024). Comparative Performance Analysis of Selected Machine Learning Algorithms and the Stacking Ensemble Method for Prediction of the Type II Diabetes Disease. Gazi University Journal of Science Part A: Engineering and Innovation, 11(3), 622-646. https://doi.org/10.54287/gujsa.1531997

Download Cover Image

Article Files

Full Text