Diabetes is a prevalent non-communicable disease affecting many people globally. The common risk factors are obesity, age, lack of exercise, lifestyle, genetic factors, high blood pressure, and poor diet. Early identification of this condition can help prevent subsequent complications, including heart attacks, lower limb amputations, nerve damage, and blindness. Data mining and machine learning have become popular and successful methods of identifying numerous diseases, including Diabetes, using clinical data over the years. This study focuses on the principles and processes of Naïve Bayes, Support Vector Machines, Logistic Regression, Decision Tree, and Random Forest algorithms for diabetes prediction, using the Scikit-learn inbuilt libraries for the experiments. Furthermore, we ensemble all five machine learning models to produce a single stacked ensemble model. Data preprocessing techniques such as scaling, missing data removal, dimensionality reduction, and balancing of target class were performed on the Jos Urban Diabetes dataset used for this study. The comparison of the algorithms' performances across various evaluation metrics, demonstrates that the Support Vector Machines algorithm outperform all others in terms of Accuracy, Precision, Sensitivity, and Matthew’s Correlation Coefficient with scores of 96.11%, 91.61%, 85.67%, and 82.59% respectively with 10-fold cross-validation. Furthermore, the Stacked Ensemble Method model had the best Area Under the Receiver Operating Characteristic Curve scores of 98.47% with 10-fold cross-validation.
Primary Language | English |
---|---|
Subjects | Machine Learning (Other) |
Journal Section | Information and Computing Sciences |
Authors | |
Publication Date | September 30, 2024 |
Submission Date | August 12, 2024 |
Acceptance Date | September 16, 2024 |
Published in Issue | Year 2024 |