Research Article
BibTex RIS Cite

Investigating the Effect of Class Balancing Methods on the Performance of Machine Learning Techniques: Credit Risk Application

Year 2024, Volume: 5 Issue: 1, 55 - 70, 04.07.2024
https://doi.org/10.56203/iyd.1436742

Abstract

Credit risk arises as a result of the failure of the loans given by banks to the customers to fulfill their obligations at the end of the specified term. Technological advances allow the use of machine learning methods in various sectors. These methods aim to facilitate the identification of customers at risk with the system adapted to the creditworthiness processes of banks. For this purpose, in order to make the most appropriate evaluation in the lending process of banks, re-sampling techniques to eliminate the problem of class imbalance encountered in unbalanced data sets were made balanced and their effects on machine learning were investigated. During the implementation phase, German, Australian and HMEQ credit data sets were used. Different machine learning classification methods such as Logistic Regression (LR), K-Narest Neighbor (KNN), Naive Bayes (NB), Support Vector Machines (SVM), Multilayer Perceptron (MLP), Decision Trees (DT), Random Forests (RF), Gradient Boosting Decision Trees (GBDT), Extremely Randomized Trees, Hard and Soft Voting were used to detect risky customers. The problem of class imbalance was balanced with resampling and hybrid techniques such as Random Oversampling (ROS), Random Undersampling (RUS), Balanced Bagging Classifier (BBC), SMOTE-Tomek Links and SMOTE-ENN. In this context, the performances of three different data sets were examined in four different scenarios. As a result of the study, the hybrid method, in which oversampling and undersampling methods are used together for the class balancing problem, showed the best classification performance among machine learning techniques.

References

  • Akman, M., Genç, Y. ve Ankarali, H. (2011). Random Forests Yöntemi ve Saglik Alaninda Bir Uygulama/Random Forests Methods and an Application in Health Science. Türkiye Klinikleri Biyoistatistik. 3(1): 36.
  • Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S. ve Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access. 8: 201173-201198.
  • Barros, T. M., Souza Neto, P. A., Silva, I. ve Guedes, L. A. (2019). Predictive models for imbalanced data: A school dropout perspective. Education Sciences. 9(4): 275.
  • Batista, G. E., Bazzan, A. L. ve Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (ss. 10-18).
  • Bradley, A. P., Duin, R. P. W., Paclik, P. ve Landgrebe, T. C. W. (2006). Precision-Recall Operating Characteristic (P-ROC) Curves in Imprecise Environments. In 18th International Conference on Pattern Recognition (ICPR'06) (pp.123-127). Cambridge , United Kingdom.
  • Breiman, L. (2001). Random forests. Machine learning. 45(1): 5-32.
  • Boughorbel, S., Jarray, F. ve El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS One. 12(6): 0177678.
  • Chicco, D. ve Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics. 21(1): 1-13.
  • Chicco, D., Warrens, M. J. ve Jurman, G. (2021). The Matthews Correlation Coefficient (MCC) is More Informative Than
  • Dahiya, S., Handa, S. S. ve Singh, N. P. (2016). Impact of Bagging on MLP Classifier for Credit Evaluation. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE (pp. 3794-3800).
  • Domingos, P. ve Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss. Machine learning. 29(2): 103-130.
  • Duda, R. O. ve Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley.
  • Dumitrescu, E., Hue, S., Hurlin, C. ve Tokpavi, S. (2022). Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. European Journal of Operational Research. 297(3): 1178-1192.
  • Gehrke, J. (2003). Decision Trees. The Handbook of Data Mining (pp. 3-24). Editors Nong Ye. New Jersey: Lawrence Erlbaum Associates Inc.
  • Gupta, S., Kumar, D. ve Sharma, A. 2011. Data Mining Classification Techniques Applied for Breast Cancer Diagnosis and Prognosis. Indian Journal of Computer Science and Engineering. 2(2): 188-195.
  • Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. ve Tatham, R. L. (1998). Multivariate data analysis . Uppersaddle River. Multivariate Data Analysis (5th ed) Upper Saddle River. 5(3): 207-219.
  • Han, J., Pei, J. ve Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
  • Hofmann, Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository.
  • Horning, N. (2010). Random Forests: An Algorithm for Image Classification and Generation of Continuous Fields Data Sets. International Conference On Bioinformatics for Spatial Infrastructure Development in Earth andAllied Sciences 2010 (pp.1–6). Osaka, Japan.
  • Hosmer, D. W. ve Lemesbow, S. (1980). Goodness of Fit Tests for the Multiple Logistic Regression Model. Communications in Statistics-Theory and Methods. 9(10): 1043–1069.
  • Hou, W. H., Wang, X. K., Zhang, H. Y., Wang, J. Q. ve Li, L. (2020). A novel dynamic ensemble selection classifier for an imbalanced data set: An application for credit risk assessment. Knowledge-Based Systems. 208: 106462.
  • Jin, Y., Zhang, W., Wu, X., Liu, Y. ve Hu, Z. (2021). A Novel Multi-Stage Ensemble Model With a Hybrid Genetic Algorithm for Credit Scoring on Imbalanced Data. IEEE Access. 9: 143593-143607.
  • Khemakhem, S., Ben Said, F. ve Boujelbene, Y. (2018). Credit Risk Assessment for Unbalanced Datasets Based on Data Mining, Artificial Neural Network and Support Vector Machines. Journal of Modelling in Management. 13(4): 932-951.
  • Kotsiantis, S., Kanellopoulos, D. ve Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS international transactions on computer science and engineering. 30(1): 25-36.
  • Kumar, A. (20.01.2022). Accuracy, Precision, Recall and F1-Score–Python Examples. https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/ #What _ is_ Recall_ Score, (26.02.2022).
  • Kumari, M. ve Godara, S. (2011). Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction. I. International Journal of Computer Science and Technology. 2: 304-308.
  • Le, C. T. ve Eberly, L. E. (2016). Introductory Biostatistics (2nd ed.). New Jersey: John Wiley & Sons.
  • Li, Y., Zhang, W. ve Lin, C. (2006). Simplify support vector machines by iterative learning. Neural Information Processing: Letters and Reviews. 10(1): 11-17.
  • Maciel, L. S. ve Ballini, R. (2008). Design a Neural Network for Time Series Financial Forecasting: Accuracy and Robustness Analysis. Anales do 9º Encontro Brasileiro de Finanças. Sao Pablo, Brazil.
  • Mahabub, A. (2020). A robust technique of fake news detection using Ensemble Voting Classifier and comparison with other classifiers. SN Applied Sciences. 2(4): 1-9.
  • Mahabub, A., Mahmud, M. I. ve Hossain, M. F. (2019). A robust system for message filtering using an ensemble machine learning supervised approach. ICIC Express Letters, Part B: Applications. 10(9): 805-812.
  • Malekipirbazari, M. ve Aksakalli, V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications. 42(10): 4621-4631.
  • Markelle Kelly, Rachel Longjohn, Kolby Nottingham. The UCI Machine Learning Repository, https://archive.ics.uci.edu.
  • Marksfeld, A. E. (2018). Strojové učení v sociodemografické segmentaci zákazníků telekomunikační společnosti. (Unpublished Bachelor's Thesis,). Czech Republic: Czech Technical University Computer and Information Center.
  • Martin, S. B. (2001). Techniques in Support Vector Classification. (Yayınlanmış Doktora Tezi). USA: Colorado State University.
  • McClelland, J. L., Rumelhart, D. E. ve Hinton,G. E. (1986). The appeal of parallel distributed processing. MIT Press, Cambridge MA, 3-44.
  • Muaz, A., Jayabalan, M. ve Thiruchelvam, V. (2020). A Comparison of Data Sampling Techniques for Credit Card Fraud Detection. International Journal of Advanced Computer Science and Applications (IJACSA).11(6).
  • Natekin, A. ve Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in neurorobotics. 7:21.
  • Niu, K., Zhang, Z., Liu, Y. ve Li, R. (2020). Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences. 536: 120-134.
  • Özekes, S. (2003). Veri Madenciliği Modelleri ve Uygulama Alanları. İstanbul Ticaret Üniversitesi Fen Bilimleri Dergisi. 2(3): 65-82.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia. 4(2):1883.
  • Quinlan,Ross. Statlog (Australian Credit Approval). UCI Machine Learning Repository.
  • Shen, F., Zhao, X., Li, Z., Li, K. ve Meng, Z. (2019). A novel ensemble classification model based on neural networks and a classifier optimisation technique for imbalanced credit risk evaluation. Physica A: Statistical Mechanics and its Applications. 526: 121073.
  • Tu, M. C., Shin, D. ve Shin, D. (2009). A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithms. In 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing . (pp.183-187).
  • Vakili, M., Ghamsari, M. ve Rezaei, M. (2020). Performance Analysis and Comparison of Machine and Deep Learning Algorithms for Iot Data Classification. arXiv preprint arXiv:2001.09636.
  • Vallala, A. (2017). HMEQ_Dataset. Kaggle. https://www.kaggle.com/datasets/ajay1735/hmeq-data/data, (10.12.2021).
  • Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer Science+ Business Media, Inc.
  • Weinberger, K. Q., & Saul, L. K. (2008). Fast solvers and efficient implementations for distance metric learning. In Proceedings of the 25th international conference on Machine learning (ss. 1160-1167).
  • Wood, T. (2020). What is the F-score?. https://deepai.org/machine-learning-glossary-and-terms/f-score#:~:text= The%20F%2 Dscore%2C%20also% 20called, positive '% 20or %20'negative', (27.02.2022).

Sınıf Dengeleme Yöntemlerinin Makine Öğrenmesi Tekniklerinin Performansları Üzerindeki Etkilerinin Araştırılması: Kredi Riski Uygulaması

Year 2024, Volume: 5 Issue: 1, 55 - 70, 04.07.2024
https://doi.org/10.56203/iyd.1436742

Abstract

Bankalar tarafından müşterilere verilen kredilerin belirlenen vade sonunda yükümlülüklerini yerine getirememesi sonucu kredi riski ortaya çıkmaktadır. Teknolojik gelişmeler, çeşitli sektörlerde makine öğrenmesi yöntemlerinin kullanılmasına olanak tanımaktadır. Bu yöntemler, bankaların kredibilite süreçlerine uyarlanan sistem ile risk altındaki müşterilerin saptanmasını kolaylaştırmayı amaçlamaktadır. Bu amaçla, bankaların kredi verme sürecinde en uygun değerlendirmenin yapılabilmesi için dengesiz veri setlerinde karşılaşılan sınıf dengesizliği probleminin ortadan kaldırılması için yeniden örnekleme teknikleri ile veri setleri dengeli bir hâle getirilerek makine öğrenmesi üzerindeki etkileri araştırılmıştır. Uygulamada, Alman, Avustralya ve HMEQ kredi veri setleri kullanılmıştır. Riskli müşterilerin belirlenmesinde Lojistik Regresyon (LR), K-En Yakın komşu (KNN), Naive Bayes (NB), Destek Vektör Makineleri (SVM), Çok Katmanlı Algılayıcı (MLP), Karar Ağaçları (DT), Rassal Ormanlar (RF), Gradyan Artırma Karar Ağaçları (GBDT), Extremely Randomized Trees, Sert ve Yumuşak Oylama olmak üzere farklı makine öğrenmesi teknikleri kullanılmıştır. Sınıf dengesizliği sorunu; Random Oversampling (ROS), Random Undersampling (RUS), Balanced Bagging Classifier (BBC), SMOTE-Tomek Links ve SMOTE-ENN gibi yeniden örnekleme ve hibrit teknikler ile sınıflar dengeli hâle getirilmiştir. Bu kapsamda, üç farklı veri kümesinin performansları dört farklı senaryo üzerinde incelenmiştir. Çalışmanın sonucunda, sınıf dengeleme problemi için aşağı ve yukarı örnekleme yöntemlerinin bir arada kullanıldığı hibrit yöntem, makine öğrenme teknikleri arasında en iyi sınıflandırma performansı göstermiştir.

References

  • Akman, M., Genç, Y. ve Ankarali, H. (2011). Random Forests Yöntemi ve Saglik Alaninda Bir Uygulama/Random Forests Methods and an Application in Health Science. Türkiye Klinikleri Biyoistatistik. 3(1): 36.
  • Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S. ve Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access. 8: 201173-201198.
  • Barros, T. M., Souza Neto, P. A., Silva, I. ve Guedes, L. A. (2019). Predictive models for imbalanced data: A school dropout perspective. Education Sciences. 9(4): 275.
  • Batista, G. E., Bazzan, A. L. ve Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (ss. 10-18).
  • Bradley, A. P., Duin, R. P. W., Paclik, P. ve Landgrebe, T. C. W. (2006). Precision-Recall Operating Characteristic (P-ROC) Curves in Imprecise Environments. In 18th International Conference on Pattern Recognition (ICPR'06) (pp.123-127). Cambridge , United Kingdom.
  • Breiman, L. (2001). Random forests. Machine learning. 45(1): 5-32.
  • Boughorbel, S., Jarray, F. ve El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS One. 12(6): 0177678.
  • Chicco, D. ve Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics. 21(1): 1-13.
  • Chicco, D., Warrens, M. J. ve Jurman, G. (2021). The Matthews Correlation Coefficient (MCC) is More Informative Than
  • Dahiya, S., Handa, S. S. ve Singh, N. P. (2016). Impact of Bagging on MLP Classifier for Credit Evaluation. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE (pp. 3794-3800).
  • Domingos, P. ve Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss. Machine learning. 29(2): 103-130.
  • Duda, R. O. ve Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley.
  • Dumitrescu, E., Hue, S., Hurlin, C. ve Tokpavi, S. (2022). Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. European Journal of Operational Research. 297(3): 1178-1192.
  • Gehrke, J. (2003). Decision Trees. The Handbook of Data Mining (pp. 3-24). Editors Nong Ye. New Jersey: Lawrence Erlbaum Associates Inc.
  • Gupta, S., Kumar, D. ve Sharma, A. 2011. Data Mining Classification Techniques Applied for Breast Cancer Diagnosis and Prognosis. Indian Journal of Computer Science and Engineering. 2(2): 188-195.
  • Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. ve Tatham, R. L. (1998). Multivariate data analysis . Uppersaddle River. Multivariate Data Analysis (5th ed) Upper Saddle River. 5(3): 207-219.
  • Han, J., Pei, J. ve Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
  • Hofmann, Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository.
  • Horning, N. (2010). Random Forests: An Algorithm for Image Classification and Generation of Continuous Fields Data Sets. International Conference On Bioinformatics for Spatial Infrastructure Development in Earth andAllied Sciences 2010 (pp.1–6). Osaka, Japan.
  • Hosmer, D. W. ve Lemesbow, S. (1980). Goodness of Fit Tests for the Multiple Logistic Regression Model. Communications in Statistics-Theory and Methods. 9(10): 1043–1069.
  • Hou, W. H., Wang, X. K., Zhang, H. Y., Wang, J. Q. ve Li, L. (2020). A novel dynamic ensemble selection classifier for an imbalanced data set: An application for credit risk assessment. Knowledge-Based Systems. 208: 106462.
  • Jin, Y., Zhang, W., Wu, X., Liu, Y. ve Hu, Z. (2021). A Novel Multi-Stage Ensemble Model With a Hybrid Genetic Algorithm for Credit Scoring on Imbalanced Data. IEEE Access. 9: 143593-143607.
  • Khemakhem, S., Ben Said, F. ve Boujelbene, Y. (2018). Credit Risk Assessment for Unbalanced Datasets Based on Data Mining, Artificial Neural Network and Support Vector Machines. Journal of Modelling in Management. 13(4): 932-951.
  • Kotsiantis, S., Kanellopoulos, D. ve Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS international transactions on computer science and engineering. 30(1): 25-36.
  • Kumar, A. (20.01.2022). Accuracy, Precision, Recall and F1-Score–Python Examples. https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/ #What _ is_ Recall_ Score, (26.02.2022).
  • Kumari, M. ve Godara, S. (2011). Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction. I. International Journal of Computer Science and Technology. 2: 304-308.
  • Le, C. T. ve Eberly, L. E. (2016). Introductory Biostatistics (2nd ed.). New Jersey: John Wiley & Sons.
  • Li, Y., Zhang, W. ve Lin, C. (2006). Simplify support vector machines by iterative learning. Neural Information Processing: Letters and Reviews. 10(1): 11-17.
  • Maciel, L. S. ve Ballini, R. (2008). Design a Neural Network for Time Series Financial Forecasting: Accuracy and Robustness Analysis. Anales do 9º Encontro Brasileiro de Finanças. Sao Pablo, Brazil.
  • Mahabub, A. (2020). A robust technique of fake news detection using Ensemble Voting Classifier and comparison with other classifiers. SN Applied Sciences. 2(4): 1-9.
  • Mahabub, A., Mahmud, M. I. ve Hossain, M. F. (2019). A robust system for message filtering using an ensemble machine learning supervised approach. ICIC Express Letters, Part B: Applications. 10(9): 805-812.
  • Malekipirbazari, M. ve Aksakalli, V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications. 42(10): 4621-4631.
  • Markelle Kelly, Rachel Longjohn, Kolby Nottingham. The UCI Machine Learning Repository, https://archive.ics.uci.edu.
  • Marksfeld, A. E. (2018). Strojové učení v sociodemografické segmentaci zákazníků telekomunikační společnosti. (Unpublished Bachelor's Thesis,). Czech Republic: Czech Technical University Computer and Information Center.
  • Martin, S. B. (2001). Techniques in Support Vector Classification. (Yayınlanmış Doktora Tezi). USA: Colorado State University.
  • McClelland, J. L., Rumelhart, D. E. ve Hinton,G. E. (1986). The appeal of parallel distributed processing. MIT Press, Cambridge MA, 3-44.
  • Muaz, A., Jayabalan, M. ve Thiruchelvam, V. (2020). A Comparison of Data Sampling Techniques for Credit Card Fraud Detection. International Journal of Advanced Computer Science and Applications (IJACSA).11(6).
  • Natekin, A. ve Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in neurorobotics. 7:21.
  • Niu, K., Zhang, Z., Liu, Y. ve Li, R. (2020). Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences. 536: 120-134.
  • Özekes, S. (2003). Veri Madenciliği Modelleri ve Uygulama Alanları. İstanbul Ticaret Üniversitesi Fen Bilimleri Dergisi. 2(3): 65-82.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
  • Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia. 4(2):1883.
  • Quinlan,Ross. Statlog (Australian Credit Approval). UCI Machine Learning Repository.
  • Shen, F., Zhao, X., Li, Z., Li, K. ve Meng, Z. (2019). A novel ensemble classification model based on neural networks and a classifier optimisation technique for imbalanced credit risk evaluation. Physica A: Statistical Mechanics and its Applications. 526: 121073.
  • Tu, M. C., Shin, D. ve Shin, D. (2009). A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithms. In 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing . (pp.183-187).
  • Vakili, M., Ghamsari, M. ve Rezaei, M. (2020). Performance Analysis and Comparison of Machine and Deep Learning Algorithms for Iot Data Classification. arXiv preprint arXiv:2001.09636.
  • Vallala, A. (2017). HMEQ_Dataset. Kaggle. https://www.kaggle.com/datasets/ajay1735/hmeq-data/data, (10.12.2021).
  • Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer Science+ Business Media, Inc.
  • Weinberger, K. Q., & Saul, L. K. (2008). Fast solvers and efficient implementations for distance metric learning. In Proceedings of the 25th international conference on Machine learning (ss. 1160-1167).
  • Wood, T. (2020). What is the F-score?. https://deepai.org/machine-learning-glossary-and-terms/f-score#:~:text= The%20F%2 Dscore%2C%20also% 20called, positive '% 20or %20'negative', (27.02.2022).
There are 50 citations in total.

Details

Primary Language English
Subjects Operation
Journal Section Research Articles
Authors

Migraç Enes Furkan Milli 0000-0003-2516-7723

Serkan Aras 0000-0002-6808-3979

İpek Deveci Kocakoç 0000-0001-9155-8269

Early Pub Date June 27, 2024
Publication Date July 4, 2024
Submission Date February 13, 2024
Acceptance Date June 27, 2024
Published in Issue Year 2024 Volume: 5 Issue: 1

Cite

APA Milli, M. E. F., Aras, S., & Deveci Kocakoç, İ. (2024). Investigating the Effect of Class Balancing Methods on the Performance of Machine Learning Techniques: Credit Risk Application. İzmir Yönetim Dergisi, 5(1), 55-70. https://doi.org/10.56203/iyd.1436742

Before uploading your article to the system, make sure to use the templates and spelling rules. The referee process will not be started for the works that do not comply with the spelling rules.