Research Article
BibTex RIS Cite

The effects of missing data imputation methods with machine learning on classification performance

Year 2023, Volume: 5 Issue: 1, 51 - 71, 30.06.2023
https://doi.org/10.51177/kayusosder.1307226

Abstract

References

  • Abidin, N. Z., Ismail, A. R., & Emran, N. A. (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6), 442-447. https://dx.doi.org/10.14569/IJACSA.2018.090660
  • Alamoodi, A. H., Zaidan, B. B., Zaidan, A. A., Albahri, O. S., Chen, J., Chyad, M. A., Garfan, S., & Aleesa, A. M. (2021). Machine learning-based ımputation soft computing approach for large missing scale and non-reference data ımputation. Chaos, Solitons & Fractals, 151, 111236. https://doi.org/10.1016/j.chaos.2021.111236
  • Allison, P. D. (2009). Missing data, handbook of quantitative methods in psychology (Editor: Roger E. Millsap ve Alberto Maydeu-Olivares), Sage Publications.
  • Baraldi, A. N., & Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48(1), 5-37. https://doi.org/10.1016/j.jsp.2009.10.001
  • Bi, Q., Goodman, K. E., Kaminsky, J., & Lessler, J. (2019). What is machine learning? A primer for the epidemiologist. American Journal of Epidemiology, 188(12), 2222-2239. https://doi.org/10.1093/aje/kwz189
  • Brynjolfsson, E., & Mitchell, T. (2017). What can machine learning do? Workforce implications. Science, 358(6370), 1530-1534. https://doi.org/10.1126/science.aap8062
  • Dogan, C. D. (2017). Applying bootstrap resampling to compute confidence intervals for various statistics with R. Eurasian Journal of Educational Research, 17(68), 1-18. https://dergipark.org.tr/en/download/article-file/623638
  • Doğru, F. Z., Bulut, Y. M., & Arslan, O. (2016). Finite mixtures of matrix variate t-distributions. Gazi University Journal of Science, 29(2), 335-341. https://dergipark.org.tr/tr/download/article-file/225490
  • Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087-1091. https://doi.org/10.1016/j.jclinepi.2006.01.014
  • Durmuş, B., & Güneri, Ö. İ. (2019). Data mining with R: An applied study. International Journal of Computing Sciences Research, 3(3), 201-216. https://doi.org/10.25147/ijcsr.2017.001.1.34
  • Durmuş, B., & Güneri, Ö. İ. (2021). A classification study for re-determination of the geographical regions: The case of Turkey. International Journal of Applied Mathematics Electronics and Computers, 9(4), 97-102. https://doi.org/10.18100/ijamec.988273
  • Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1-37. https://doi.org/10.1186/s40537-021-00516-9
  • Enders, C. K. (2022). Applied missing data analysis. Guilford Publications.
  • Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1–47. https://doi.org/10.18637/jss.v045.i07
  • Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913-933. https://doi.org/10.1080/08839514.2019.1637138
  • Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence In Medicine, 50(2), 105-115. https://doi.org/10.1016/j.artmed.2010.05.002
  • Kenyhercz, M. W., & Passalacqua, N. V. (2016). Missing data imputation methods and their performance with biodistance analyses. Biological Distance Analysis (pp. 181-194). Academic Press. https://doi.org/10.1016/B978-0-12-801966-5.00009-3
  • Köse, I. A., & Öztemur, B. (2014). Kayıp veri ele alma yöntemlerinin t-testi ve ANOVA parametreleri üzerine etkisinin incelenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 14(1), 400-412. https://dergipark.org.tr/tr/download/article-file/16769
  • Mahesh, B.(2019). Machine learning algorithms-A review. International Journal of Science and Research, 9(1), 381-386.
  • Oprea, C. (2014). Performance evaluation of the data mining classification methods. Information Society and Sustainable Development, 1(Special Issue), 249-253. https://www.utgjiu.ro/revista/ec/pdf/2014-04.Special/45_Oprea%20Cristina.pdf
  • Palanivinayagam, A., & Damaševičius, R. (2023). Effective handling of missing values in datasets for classification using machine learning methods. Information, 14(2), 92. https://doi.org/10.3390/info14020092
  • Raja, P. S., & Thangavel, K. J. S. C. (2020). Missing value imputation using unsupervised machine learning techniques. Soft Computing, 24(6), 4361-4392. https://doi.org/10.1007/s00500-019-04199-6
  • Schaffer, J. L. (1997). Analysis of incomplete multivariate data. Chapman&Hall.
  • Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6), 363-377. https://doi.org/10.1002/sam.11348
  • Thomas, T., & Rajabi, E. (2021). A systematic review of machine learning-based missing value ımputation techniques. Data Technologies and Applications, 55(4), 558-585. https://doi.org/10.1108/DTA-12-2020-0298
  • Vangipuram, R., Gunupudi, R. K., Puligadda, V. K., & Vinjamuri, J. (2020). A machine learning approach for imputation and anomaly detection in iot environment. Expert Systems, 37(5), e12556. https://doi.org/10.1111/exsy.12556
  • Vembandasamy, K., Sasipriya, R., & Deepa, E. (2015). Heart diseases detection using naive bayes algorithm. International Journal of Innovative Science, Engineering & Technology, 2(9), 441-444. https://ijiset.com/vol2/v2s9/IJISET_V2_I9_54.pdf
  • Zhang, Z. (2016). Introduction to machine learning: K-nearest neighbors. Annals of Translational Medicine, 4(11), 218-224. https://doi.org/10.21037/atm.2016.03.37

Makine Öğrenmesi İle Eksik Veri Tamamlama Yöntemlerinin Sınıflandırma Performansına Etkileri

Year 2023, Volume: 5 Issue: 1, 51 - 71, 30.06.2023
https://doi.org/10.51177/kayusosder.1307226

Abstract

Araştırma yapmak üzere toplanmış veri setlerindeki değerlerde eksiklerin olması sıklıkla karşılaşılan bir problemdir. Bu problemi çözmek adına literatürde, eksik değerlerin tamamlamasına ilişkin yöntemler bulunmaktadır. Bilgi teknolojileri ve veri yönetimindeki gelişmelerle birlikte ilgili probleme ilişkin yöntemler artmış ve makine öğrenmesi yöntemleri de eksik değerleri tamamlamada kullanılmaya başlanmıştır. Çalışma kapsamında, literatürde sıklıkla yararlanılan “Hitters” veri seti kullanılmıştır. Bu veri setindeki değerler, manipüle edilerek eksiltilmiş ve eksiltilen değerler Liste Boyunca Silme, Son Gözlemi İleri Taşıma, Ortalama Atama gibi temel eksik değer tamamlama yöntemlerinin yanı sıra Stokastik Regresyon, En Yakın k- Komşu algoritması, Random Forest algoritması ve Amelia algoritması gibi makine öğrenmesi yöntemleriyle tamamlanmıştır. Veri setinin eksiltilmemiş hali ve eksik değerleri, bahsedilen yöntemlerle tamamlanarak elde edilen veri setleri, WEKA paket programı kullanılarak Naive Bayes algoritmasıyla sınıflandırılmıştır. Sınıflandırma sonuçları, sınıflandırma süresi, doğruluk, kesinlik, duyarlılık, F-ölçütü ve ROC alanı performans değerlendirme kriterleriyle kıyaslanmıştır. Çalışmanın sonucunda, makine öğrenmesi yöntemlerinin, eksik veri tamamlamada ve sınıflandırma operasyonlarının performanslarını yükseltmede başarılı sonuçlar ortaya koyduğu görülmüştür.

References

  • Abidin, N. Z., Ismail, A. R., & Emran, N. A. (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6), 442-447. https://dx.doi.org/10.14569/IJACSA.2018.090660
  • Alamoodi, A. H., Zaidan, B. B., Zaidan, A. A., Albahri, O. S., Chen, J., Chyad, M. A., Garfan, S., & Aleesa, A. M. (2021). Machine learning-based ımputation soft computing approach for large missing scale and non-reference data ımputation. Chaos, Solitons & Fractals, 151, 111236. https://doi.org/10.1016/j.chaos.2021.111236
  • Allison, P. D. (2009). Missing data, handbook of quantitative methods in psychology (Editor: Roger E. Millsap ve Alberto Maydeu-Olivares), Sage Publications.
  • Baraldi, A. N., & Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48(1), 5-37. https://doi.org/10.1016/j.jsp.2009.10.001
  • Bi, Q., Goodman, K. E., Kaminsky, J., & Lessler, J. (2019). What is machine learning? A primer for the epidemiologist. American Journal of Epidemiology, 188(12), 2222-2239. https://doi.org/10.1093/aje/kwz189
  • Brynjolfsson, E., & Mitchell, T. (2017). What can machine learning do? Workforce implications. Science, 358(6370), 1530-1534. https://doi.org/10.1126/science.aap8062
  • Dogan, C. D. (2017). Applying bootstrap resampling to compute confidence intervals for various statistics with R. Eurasian Journal of Educational Research, 17(68), 1-18. https://dergipark.org.tr/en/download/article-file/623638
  • Doğru, F. Z., Bulut, Y. M., & Arslan, O. (2016). Finite mixtures of matrix variate t-distributions. Gazi University Journal of Science, 29(2), 335-341. https://dergipark.org.tr/tr/download/article-file/225490
  • Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087-1091. https://doi.org/10.1016/j.jclinepi.2006.01.014
  • Durmuş, B., & Güneri, Ö. İ. (2019). Data mining with R: An applied study. International Journal of Computing Sciences Research, 3(3), 201-216. https://doi.org/10.25147/ijcsr.2017.001.1.34
  • Durmuş, B., & Güneri, Ö. İ. (2021). A classification study for re-determination of the geographical regions: The case of Turkey. International Journal of Applied Mathematics Electronics and Computers, 9(4), 97-102. https://doi.org/10.18100/ijamec.988273
  • Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1-37. https://doi.org/10.1186/s40537-021-00516-9
  • Enders, C. K. (2022). Applied missing data analysis. Guilford Publications.
  • Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1–47. https://doi.org/10.18637/jss.v045.i07
  • Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913-933. https://doi.org/10.1080/08839514.2019.1637138
  • Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence In Medicine, 50(2), 105-115. https://doi.org/10.1016/j.artmed.2010.05.002
  • Kenyhercz, M. W., & Passalacqua, N. V. (2016). Missing data imputation methods and their performance with biodistance analyses. Biological Distance Analysis (pp. 181-194). Academic Press. https://doi.org/10.1016/B978-0-12-801966-5.00009-3
  • Köse, I. A., & Öztemur, B. (2014). Kayıp veri ele alma yöntemlerinin t-testi ve ANOVA parametreleri üzerine etkisinin incelenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 14(1), 400-412. https://dergipark.org.tr/tr/download/article-file/16769
  • Mahesh, B.(2019). Machine learning algorithms-A review. International Journal of Science and Research, 9(1), 381-386.
  • Oprea, C. (2014). Performance evaluation of the data mining classification methods. Information Society and Sustainable Development, 1(Special Issue), 249-253. https://www.utgjiu.ro/revista/ec/pdf/2014-04.Special/45_Oprea%20Cristina.pdf
  • Palanivinayagam, A., & Damaševičius, R. (2023). Effective handling of missing values in datasets for classification using machine learning methods. Information, 14(2), 92. https://doi.org/10.3390/info14020092
  • Raja, P. S., & Thangavel, K. J. S. C. (2020). Missing value imputation using unsupervised machine learning techniques. Soft Computing, 24(6), 4361-4392. https://doi.org/10.1007/s00500-019-04199-6
  • Schaffer, J. L. (1997). Analysis of incomplete multivariate data. Chapman&Hall.
  • Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6), 363-377. https://doi.org/10.1002/sam.11348
  • Thomas, T., & Rajabi, E. (2021). A systematic review of machine learning-based missing value ımputation techniques. Data Technologies and Applications, 55(4), 558-585. https://doi.org/10.1108/DTA-12-2020-0298
  • Vangipuram, R., Gunupudi, R. K., Puligadda, V. K., & Vinjamuri, J. (2020). A machine learning approach for imputation and anomaly detection in iot environment. Expert Systems, 37(5), e12556. https://doi.org/10.1111/exsy.12556
  • Vembandasamy, K., Sasipriya, R., & Deepa, E. (2015). Heart diseases detection using naive bayes algorithm. International Journal of Innovative Science, Engineering & Technology, 2(9), 441-444. https://ijiset.com/vol2/v2s9/IJISET_V2_I9_54.pdf
  • Zhang, Z. (2016). Introduction to machine learning: K-nearest neighbors. Annals of Translational Medicine, 4(11), 218-224. https://doi.org/10.21037/atm.2016.03.37
There are 28 citations in total.

Details

Primary Language Turkish
Subjects Machine Learning (Other), Data Mining and Knowledge Discovery
Journal Section Research Articles
Authors

Şemsettin Erken 0000-0002-8936-5633

Levent Şenyay 0000-0001-9484-608X

Early Pub Date June 23, 2023
Publication Date June 30, 2023
Submission Date May 30, 2023
Published in Issue Year 2023 Volume: 5 Issue: 1

Cite

APA Erken, Ş., & Şenyay, L. (2023). Makine Öğrenmesi İle Eksik Veri Tamamlama Yöntemlerinin Sınıflandırma Performansına Etkileri. Kayseri Üniversitesi Sosyal Bilimler Dergisi, 5(1), 51-71. https://doi.org/10.51177/kayusosder.1307226

The journal publishes theoretical and applied articles that make original contributions to the literature in every field of social sciences.
Authors can contribute to the journal with their Turkish and English studies prepared in accordance with our journal's writing rules. Our journal does not charge any fees to authors during the evaluation, preparation and publication of articles.

All legal responsibilities of the articles in the journal belong to the authors.
Kayseri University Journal of Social Sciences is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. The content on the site can be shared, copied, reproduced and distributed non-commercially, provided that it is published under the terms of the license, but its content cannot be changed.

Please use the article template for your article.