Research Article
BibTex RIS Cite

Comparison of Data Mining Algorithms Performances in Case of Multicollinearity

Year 2024, , 40 - 67, 30.06.2024
https://doi.org/10.51541/nicel.1371834

Abstract

As advancements in computer technologies progress, there has been an increase in research utilizing data mining algorithms. In studies involving classification algorithms, the degradation of data quality plays a significant role in algorithm performance. This study investigates the impact of multicollinearity, one of the factors that compromise data quality, on the performance of classification algorithms. To identify the presence of multicollinearity, correlation graphs of the datasets were examined, followed by the determination of the degree of multicollinearity using the condition index. The classification algorithms, namely Naive Bayes (NB), Logistic Regression (LR), k-Nearest Neighbors (kNN), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGBoost), were implemented for the analysis. Simulation studies and real dataset analyses were conducted to assess the performance of these methods, and the results were presented in tabular form. According to the analysis results, it has been determined that XGBoost algorithm shows a notable performance difference compared to other algorithms in terms of accuracy and F-measure metrics in the presence of multicollinearity in large sample-sized datasets. On the other hand, Naive Bayes was observed to be the algorithm most adversely affected by multicollinearity, showing diminished performance.

References

  • Alin, A. (2010), Multicollinearity, Wiley Interdisciplinary Reviews Computational Statistics, 2(3), 370–374. Alpar, R. (2013), Çok değişkenli istatistiksel yöntemler, Detay Yayıncılık: Ankara, Türkiye.
  • Asselman, A., Khaldi, M. and Aammou, S. (2021), Enhancing the prediction of student performance based on the machine learning xgboost algorithm, Interactive Learning Environments, 1–20.
  • Batista, G. E. A. P. A. and Monard, M. C. (2002), A study of k-nearest neighbour as an imputation method. In Abraham, A., Solar, J.R., Köppen, M. (Ed.), Frontiers in artificial intelligence and applications, 87, 251–260, IOS Press.
  • Blommaert, A., Hens, N. and Beutels, P. (2014), Data mining for longitudinal data under multicollinearity and time dependence using penalized generalized estimating equations, Computational Statistics & Data Analysis, 71(0), 667–680.
  • Burges, C. J. (1998), A tutorial on support vector machines for pattern recognation, Data Mining and Knowledge Discovery, 2(2), 121–167.
  • Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L. and Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215.
  • Chan, J.-L., Leow, S., Bea, K., Cheng, W., Phoong, S., Hong, Z.-W. and Chen, Y. L. (2022), Mitigating the multicollinearity problem and its machine learning approach: A review, Mathematics, 10(8), 1283.
  • Chen, T. and Guestrin, C. (2016), XGBoost: A scalable tree boosting system, KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA.
  • Cortes, C. and Vapnik, V. N. (1995), Support vector networks, Machine Learning, 20, 273–297.
  • Cristianini, N. and Taylor, J. S. (2000), An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press: Cambridge, UK.
  • Davidson, I. and Tayi, G. (2009), Data preparation using data quality matrices for classification mining, European Journal of Operational Research, 197(2), 764-772.
  • Demir, Y. (2020), Çoklu doğrusal regresyon ve bazı cezalı tahmin yöntemlerinin incelenmesi. In S. Öztürk (Ed.), Sosyal ve beşeri bilimlerde teori ve araştırmalar II, 2, 261-276, Gece Akademi: Ankara.
  • Derraz, R., Melissa Muharam, F., Nurulhuda, K., Ahmad Jaafar, N. and Keng Yap, N. (2023), Ensemble and single algorithm models to handle multicollinearity of UAV vegetation indices for predicting rice biomass, Computers and Electronics in Agriculture, 205, 107621.
  • Dong, Z., Li, X., Luan, F., Ding, J. and Zhang, D. (2023), Point and interval prediction of the effective length of hot-rolled plates based on IBES-XGBoost, Measurement, 214(0), 112857.
  • Dumancas, G. and Bello, G. (2015), Comparison of machine-learning techniques for handling multicollinearity in big data analytics and high performance data mining, The International Conference for High Performance Computing, Networking, Storage, and Analysis, Texas, USA.
  • Garg, A. and Tai, K. (2013), Comparison of statistical and machine learning methods in modelling of data with multicollinearity, International Journal of Modelling, Identification and Control, 18(4), 295–312.
  • Georganos, S., Grippa, T., Vanhuysse, S., Lennert. M., Shimoni, M. and Wolff, E. (2018), Very high resolution object-based land use–land cover urban classification using extreme gradient boosting, IEEE Geoscience and Remote Sensing, 15(4), 607-611.
  • Han, J., Kamber, M. and Pei, J. (2012), Data mining concepts and techniques (Third Edition). Morgan Kaufman Publishers: Massachusetts, USA.
  • Harrington, P. (2012), Machine learning in action, Manning Publications: New York, USA.
  • Hosmer, D. W., Lemeshov, S. and Sturdivant, R. X. (2013), Applied logistic regression (Third Edition). John Wiley & Sons, Inc: New Jersey, USA.
  • Kartal, E. and Balaban, M. E. (2019), Destek vektör makineleri: teori ve R dili ile bir uygulama. In M. E. Balaban, E. Kartal (Eds.), Veri madenciliği ve makine öğrenmesi temel kavramlar, algoritmalar, uygulmalar (207-241), Çağlayan Kitapevi: İstanbul.
  • Lewis, N. D. (2017), Machine learning made easy with R: An intuitive step by step blueprint for beginners, CreateSpace Independent Publishing Platform: Carolina, USA.
  • Mason, C. H. and Perreault, W. D. (1991), Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research, 28(3), 268–280.
  • McNamara, J. M., Green, R. F. and Olsson, O. (2006). Bayes’ Theorem and ıts applications in animal behaviour, Oikos, 112(2), 243–251.
  • McNamara, M. E., Zisser, M., Beevers, C. G. and Shumake, J. (2022), Not just “big” data: Importance of sample size, measurement error, and uninformative predictors for developing prognostic models for digital interventions, Behaviour Research and Therapy, 153(0), 1-12.
  • Mucherino, A., Papajorgji, P. J. and Paradalos, P. M. (2009), Data mining in agriculture, Springer: Dordrecht, Hollanda.
  • Mulla, G. A. A., Demir, Y. and Hassan, M. (2021), Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, 10(3), 858–869.
  • Obite, C. P., Olewuezi, N. P., Ugwuanyim, G. U. and Bartholomew, D. C. (2020), Multicollinearity effect in regression analysis: A feed forward artificial neural network approach, Asian Journal of Probability and Statistics, 6(1), 22-33.
  • Öz, E. (2019), Destek vektör makineleri. In S. Alp, E. Öz (Ed.), Makine öğreniminde sınıflandırma yöntemleri ve R uygulamaları (67-189), Nobel Akademik Yayıncılık: Ankara.
  • Rahman, M. M., Ghasemi, Y., Suley, E., Zhou, Y., Wang, S. and Rogers, J. (2021), Machine learning based computer aided diagnosis of breast cancer utilizing anthropometric and clinical features, IRBM, 42(4), 215-226.
  • Roelofs, R., Shankar, V., Recht, B., Fridovich-Keil, S., Hardt, M., Miller, J. and Schmidt, L. (2019), A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32.
  • Senawi, A., Wei, H.-L. and Billings, S. A. (2017), A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking, Pattern Recognition, 67, 47-61.
  • Silahtaroğlu, G. (2013), Veri Madenciliği Kavram ve Algoritmaları, Papatya Yayınevi: İstanbul. Singh, R., Biswas, M. and Pal, M. (2022), Cloud detection using sentinel 2 imageries: A comparison of XGBoost, RF, SVM, and CNN algorithms. Geocarto International, 0(0), 1–32.
  • Stoean, C., Stoean, R. (2014), Evolutionary support vector machines and their application for classification, Springer International Publishing: New York, USA.
  • Uğuz, S. (2019), Makine öğrenmesi teorik yönleri ve python uygulamaları (1. Basım). Nobel Akademik Yayıncılık: Ankara.
  • Urooj, B., Shah, M. A., Maple, C., Abbasi, M. K., Riasat, S. (2022), Malware detection: a framework for reverse engineered android applications through machine learning algorithms, IEEE Access, 10(6), 89031-89050.
  • Yan, Z., Chen, H., Dong, X., Zhou, K. and Xu, Z. (2022), Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost, Expert Systems with Applications, 207, 117943.
  • Ying, X. (2019), An overview of overfitting and its solutions, In Journal of physics: Conference series, 1168, 022022, IOP Publishing.
  • Zhang, X., Liu, S. and Zheng, X. (2021), Stock Price Movement Prediction Based on a Deep Factorization Machine and the Attention Mechanism, Mathematics, 9(8), 800.
  • Zhu, J., Ge, Z., Song, Z. and Gao, F. (2018), Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data, Annual Reviews in Control, 46(1), 107–133.

Çoklu Doğrusal Bağlantı Olması Durumunda Veri Madenciliği Algoritmaları Performanslarının Karşılaştırılması

Year 2024, , 40 - 67, 30.06.2024
https://doi.org/10.51541/nicel.1371834

Abstract

Bilgisayar teknolojilerindeki gelişmelere paralel olarak veri madenciliği algoritmaları ile yapılan çalışmalarda artış yaşanmaktadır. Sınıflandırma algoritmalar ile yapılan çalışmalarda veri kalitesinin bozulması algoritmaların performansında önemli rol oynamaktadır. Bu çalışmada veri kalitesini bozan etmenlerden birisi olan çoklu doğrusal bağlantının veri setinde bulunması durumunda sınıflandırma algoritmalarının performansının nasıl etkilendiği incelenmiştir. Çoklu doğrusal bağlantının varlığını tespit etmek için veri setlerine ait korelasyon grafikleri incelenmiş daha sonrasında ise koşul endeksi ile çoklu doğrusal bağlantının derecesi belirlenmiştir. Sınıflandırma algoritmalarından olan Naive Bayes (NB), Lojistik Regresyon (LR) ve K-En Yakın Komşu Algoritması (kNN), Destek Vektör Makineleri (SVM) ve Aşırı Gradyan Arttırma Algoritması (XGBoost) ile uygulamalar gerçekleştirilmiştir. Yöntemlerin performanslarının incelenmesi için simülasyon çalışması ve gerçek veri setleri ile uygulamalar yapılmış, sonuçlar tablolar halinde sunulmuştur. Analiz sonuçlarına göre, çoklu doğrusal bağlantı varlığında büyük örneklem hacimli veri setlerinde doğruluk ve F-ölçütü metriklerine göre XGBoost algoritmasının diğer algoritmalardan dikkate değer performans farklılığı gösterdiği belirlenmiştir. Çoklu doğrusal bağlantından performansı en olumsuz etkilenen algoritmanın ise Naive Bayes olduğu gözlenmiştir.

References

  • Alin, A. (2010), Multicollinearity, Wiley Interdisciplinary Reviews Computational Statistics, 2(3), 370–374. Alpar, R. (2013), Çok değişkenli istatistiksel yöntemler, Detay Yayıncılık: Ankara, Türkiye.
  • Asselman, A., Khaldi, M. and Aammou, S. (2021), Enhancing the prediction of student performance based on the machine learning xgboost algorithm, Interactive Learning Environments, 1–20.
  • Batista, G. E. A. P. A. and Monard, M. C. (2002), A study of k-nearest neighbour as an imputation method. In Abraham, A., Solar, J.R., Köppen, M. (Ed.), Frontiers in artificial intelligence and applications, 87, 251–260, IOS Press.
  • Blommaert, A., Hens, N. and Beutels, P. (2014), Data mining for longitudinal data under multicollinearity and time dependence using penalized generalized estimating equations, Computational Statistics & Data Analysis, 71(0), 667–680.
  • Burges, C. J. (1998), A tutorial on support vector machines for pattern recognation, Data Mining and Knowledge Discovery, 2(2), 121–167.
  • Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L. and Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215.
  • Chan, J.-L., Leow, S., Bea, K., Cheng, W., Phoong, S., Hong, Z.-W. and Chen, Y. L. (2022), Mitigating the multicollinearity problem and its machine learning approach: A review, Mathematics, 10(8), 1283.
  • Chen, T. and Guestrin, C. (2016), XGBoost: A scalable tree boosting system, KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA.
  • Cortes, C. and Vapnik, V. N. (1995), Support vector networks, Machine Learning, 20, 273–297.
  • Cristianini, N. and Taylor, J. S. (2000), An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press: Cambridge, UK.
  • Davidson, I. and Tayi, G. (2009), Data preparation using data quality matrices for classification mining, European Journal of Operational Research, 197(2), 764-772.
  • Demir, Y. (2020), Çoklu doğrusal regresyon ve bazı cezalı tahmin yöntemlerinin incelenmesi. In S. Öztürk (Ed.), Sosyal ve beşeri bilimlerde teori ve araştırmalar II, 2, 261-276, Gece Akademi: Ankara.
  • Derraz, R., Melissa Muharam, F., Nurulhuda, K., Ahmad Jaafar, N. and Keng Yap, N. (2023), Ensemble and single algorithm models to handle multicollinearity of UAV vegetation indices for predicting rice biomass, Computers and Electronics in Agriculture, 205, 107621.
  • Dong, Z., Li, X., Luan, F., Ding, J. and Zhang, D. (2023), Point and interval prediction of the effective length of hot-rolled plates based on IBES-XGBoost, Measurement, 214(0), 112857.
  • Dumancas, G. and Bello, G. (2015), Comparison of machine-learning techniques for handling multicollinearity in big data analytics and high performance data mining, The International Conference for High Performance Computing, Networking, Storage, and Analysis, Texas, USA.
  • Garg, A. and Tai, K. (2013), Comparison of statistical and machine learning methods in modelling of data with multicollinearity, International Journal of Modelling, Identification and Control, 18(4), 295–312.
  • Georganos, S., Grippa, T., Vanhuysse, S., Lennert. M., Shimoni, M. and Wolff, E. (2018), Very high resolution object-based land use–land cover urban classification using extreme gradient boosting, IEEE Geoscience and Remote Sensing, 15(4), 607-611.
  • Han, J., Kamber, M. and Pei, J. (2012), Data mining concepts and techniques (Third Edition). Morgan Kaufman Publishers: Massachusetts, USA.
  • Harrington, P. (2012), Machine learning in action, Manning Publications: New York, USA.
  • Hosmer, D. W., Lemeshov, S. and Sturdivant, R. X. (2013), Applied logistic regression (Third Edition). John Wiley & Sons, Inc: New Jersey, USA.
  • Kartal, E. and Balaban, M. E. (2019), Destek vektör makineleri: teori ve R dili ile bir uygulama. In M. E. Balaban, E. Kartal (Eds.), Veri madenciliği ve makine öğrenmesi temel kavramlar, algoritmalar, uygulmalar (207-241), Çağlayan Kitapevi: İstanbul.
  • Lewis, N. D. (2017), Machine learning made easy with R: An intuitive step by step blueprint for beginners, CreateSpace Independent Publishing Platform: Carolina, USA.
  • Mason, C. H. and Perreault, W. D. (1991), Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research, 28(3), 268–280.
  • McNamara, J. M., Green, R. F. and Olsson, O. (2006). Bayes’ Theorem and ıts applications in animal behaviour, Oikos, 112(2), 243–251.
  • McNamara, M. E., Zisser, M., Beevers, C. G. and Shumake, J. (2022), Not just “big” data: Importance of sample size, measurement error, and uninformative predictors for developing prognostic models for digital interventions, Behaviour Research and Therapy, 153(0), 1-12.
  • Mucherino, A., Papajorgji, P. J. and Paradalos, P. M. (2009), Data mining in agriculture, Springer: Dordrecht, Hollanda.
  • Mulla, G. A. A., Demir, Y. and Hassan, M. (2021), Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, 10(3), 858–869.
  • Obite, C. P., Olewuezi, N. P., Ugwuanyim, G. U. and Bartholomew, D. C. (2020), Multicollinearity effect in regression analysis: A feed forward artificial neural network approach, Asian Journal of Probability and Statistics, 6(1), 22-33.
  • Öz, E. (2019), Destek vektör makineleri. In S. Alp, E. Öz (Ed.), Makine öğreniminde sınıflandırma yöntemleri ve R uygulamaları (67-189), Nobel Akademik Yayıncılık: Ankara.
  • Rahman, M. M., Ghasemi, Y., Suley, E., Zhou, Y., Wang, S. and Rogers, J. (2021), Machine learning based computer aided diagnosis of breast cancer utilizing anthropometric and clinical features, IRBM, 42(4), 215-226.
  • Roelofs, R., Shankar, V., Recht, B., Fridovich-Keil, S., Hardt, M., Miller, J. and Schmidt, L. (2019), A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32.
  • Senawi, A., Wei, H.-L. and Billings, S. A. (2017), A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking, Pattern Recognition, 67, 47-61.
  • Silahtaroğlu, G. (2013), Veri Madenciliği Kavram ve Algoritmaları, Papatya Yayınevi: İstanbul. Singh, R., Biswas, M. and Pal, M. (2022), Cloud detection using sentinel 2 imageries: A comparison of XGBoost, RF, SVM, and CNN algorithms. Geocarto International, 0(0), 1–32.
  • Stoean, C., Stoean, R. (2014), Evolutionary support vector machines and their application for classification, Springer International Publishing: New York, USA.
  • Uğuz, S. (2019), Makine öğrenmesi teorik yönleri ve python uygulamaları (1. Basım). Nobel Akademik Yayıncılık: Ankara.
  • Urooj, B., Shah, M. A., Maple, C., Abbasi, M. K., Riasat, S. (2022), Malware detection: a framework for reverse engineered android applications through machine learning algorithms, IEEE Access, 10(6), 89031-89050.
  • Yan, Z., Chen, H., Dong, X., Zhou, K. and Xu, Z. (2022), Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost, Expert Systems with Applications, 207, 117943.
  • Ying, X. (2019), An overview of overfitting and its solutions, In Journal of physics: Conference series, 1168, 022022, IOP Publishing.
  • Zhang, X., Liu, S. and Zheng, X. (2021), Stock Price Movement Prediction Based on a Deep Factorization Machine and the Attention Mechanism, Mathematics, 9(8), 800.
  • Zhu, J., Ge, Z., Song, Z. and Gao, F. (2018), Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data, Annual Reviews in Control, 46(1), 107–133.
There are 40 citations in total.

Details

Primary Language Turkish
Subjects Statistical Data Science, Applied Statistics
Journal Section Articles
Authors

Saygın Diler 0000-0002-9056-412X

Yıldırım Demir 0000-0002-6350-8122

Publication Date June 30, 2024
Published in Issue Year 2024

Cite

APA Diler, S., & Demir, Y. (2024). Çoklu Doğrusal Bağlantı Olması Durumunda Veri Madenciliği Algoritmaları Performanslarının Karşılaştırılması. Nicel Bilimler Dergisi, 6(1), 40-67. https://doi.org/10.51541/nicel.1371834