DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI

Abdullah Dal; İbrahim Halil Gümüş; Serkan Güldal; Mustafa Yavaş

doi:10.54365/adyumbd.940539

Research Article

DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI

Year 2021, Volume: 8 Issue: 15, 343 - 352, 31.12.2021

Abdullah Dal , İbrahim Halil Gümüş , Serkan Güldal , Mustafa Yavaş

https://doi.org/10.54365/adyumbd.940539

Cited By: 3

Abstract

Son yıllarda makine öğrenmesi yöntemleri kullanılarak veri sınıflandırma işlemlerinde büyük gelişmeler yaşanmıştır. Teknolojik gelişmeler arttıkça, internet ortamında ve diğer ortamlarda verilerin boyutu da hızla artmaktadır. Bununla beraber dengesiz ve sınıflandırılmamış veriler ortaya çıkmıştır. Dengesizlik problemi iki sınıftan birinin diğerine göre daha az örneğe sahip olması durumudur. Özellikle tıbbi alanda kullanılan veri kümelerin çoğu dengesiz dağılıma sahiptir. Dengesiz dağılıma sahip bir veri kümesi sınıflandırıcı algoritmaların başarım performansını olumsuz yönde etkilemektedir. Bu dağılımı dengelemek ve sınıflandırmak için bir çok çalışma yapılmıştır. Bu çalışmalar veri ve algoritma düzeyinde olup, yeniden örnekleme yöntemi ile örneklem azaltma ve örneklem çoğaltma işlemleridir. Bu çalışmada azınlık sınıfa ait mevcut örnekler, yeniden sentetik olarak çoğaltılmıştır ve veri kümesi dengelenmiştir. Yeniden örnekleme işlemi için, azınlık sınıfa ait örnekler arasında, Öklid uzaklık metriğiyle tüm data noktaları için en yakın komşular tespit edilmiştir. Bu komşular baz alınarak, her örnek arasında Ağırlıklı Geometrik Ortalama kullanılarak istenen sayıda yeni sentetik örnekler oluşturulmuştur. Bu işlem sonucunda veri kümesi dengeli hale getirilmiştir. Orijinal ve dengelenmiş veri kümesi Random Forest algoritması ile sınıflandırılmış ve sonuçları kıyaslanmıştır. Çalışmanın sonucunda, orijinal ve yeniden örneklenmiş veri kümesi performans değerlerinden, genel doğruluk 0,751'den 0,797'ye ve azınlık sınıfı F-ölçüm ise 0,599'dan 0,805'e yükselmiştir. Çalışmada önerilen yaklaşım ile yeniden örneklenerek dengelenen veri kümesi, orijinal veri kümesine göre sınıflandırma performansını arttırdığı görülmüştür.

Keywords

Yeniden Örnekleme, Ağırlıklı Geometrik Ortalama, Dengesiz Veri

References

[1] E. Alpaydin, Introduction to machine learning. MIT press, 2020.
[2] D. T. Larose and C. D. Larose, Discovering knowledge in data: an introduction to data mining. John Wiley & Sons, 2014.
[3] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text classification algorithms: A survey," Information, vol. 10, no. 4, p. 150, 2019.
[4] M. S. Shelke, P. R. Deshmukh, and V. K. Shandilya, "A review on imbalanced data handling using undersampling and oversampling technique," International Journal of Recent Trends in Engineering and Research, vol. 3, no. 4, pp. 444-449, 2017.
[5] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[6] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International conference on intelligent computing, 2005: Springer, pp. 878-887.
[7] H. M. Nguyen, E. W. Cooper, and K. Kamei, "Borderline over-sampling for imbalanced data classification," International Journal of Knowledge Engineering and Soft Data Paradigms, vol. 3, no. 1, pp. 4-21, 2011.
[8] G. E. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20-29, 2004.
[9] I. Mani and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," in Proceedings of workshop on learning from imbalanced datasets, 2003, vol. 126: ICML United States.
[10] Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2007.
[11] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, "Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data," Knowledge-Based Systems, vol. 94, pp. 88-104, 2016.
[12] M. M. Rahman and D. N. Davis, "Addressing the class imbalance problem in medical datasets," International Journal of Machine Learning and Computing, vol. 3, no. 2, p. 224, 2013.
[13] KEEL. (22.04.2021). Pima Indians Diabetes Dataset [Online]. Available: https://sci2s.ugr.es/keel/dataset.php?cod=21.
[14] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
[15] A. Liaw and M. Wiener, "Classification and regression by randomForest," R news, vol. 2, no. 3, pp. 18-22, 2002.
[16] M. Ercire, "Classification of short-term power quality disturbances with wavelet analysis and random forest method," Ph.D Doctoral 2019.
[17] S. Narkhede, "Understanding auc-roc curve," Towards Data Science, vol. 26, pp. 220-227, 2018.

A NEW RESAMPLING APPROACH BASED ON WEIGHTED GEOMETRIC MEAN FOR UNBALANCED DATA

Year 2021, Volume: 8 Issue: 15, 343 - 352, 31.12.2021

Abdullah Dal , İbrahim Halil Gümüş , Serkan Güldal , Mustafa Yavaş

https://doi.org/10.54365/adyumbd.940539

Cited By: 3

Abstract

In recent years, there have been great improvements in data classification processes using machine learning methods. As technological advances increase, the size of data in the internet and other environments also increases rapidly. With these developments, unbalanced and unclassified data has emerged. The problem of imbalance is that one of the two classes has fewer samples than the other. Most of the datasets, especially used in the medical field, have an unbalanced distribution. A dataset with unbalanced distribution negatively affects the performance of classification algorithms. Many studies have been conducted to balance and classify this distribution. These studies are at the data and algorithm level and are undersampling and oversampling processes. In this study, the existing samples belonging to minority class were resampled synthetically and the dataset was balanced. For the resampling process, among the samples belonging to the minority class, the closest neighbors were determined for all data points using the Euclidean distance metric. Based on these neighbors, the desired number of new synthetic samples were created between each sample using the Weighted Geometric Mean. As a result of this process, the dataset has been balanced. The raw and balanced datasets are classified using the Random Forest algorithm and the results are compared. As a result of the study, from the raw and resampled dataset performance values, the overall accuracy increased from 0.751 to 0.797 and the minority class F-measure increased from 0.599 to 0.805. Using the approach proposed in the study, it is shown that the balanced dataset using the resampling method improves the classification performance compared to the raw dataset.

Keywords

Resampling, Weighted Geometric Mean, Unbalanced Data

References

[1] E. Alpaydin, Introduction to machine learning. MIT press, 2020.
[2] D. T. Larose and C. D. Larose, Discovering knowledge in data: an introduction to data mining. John Wiley & Sons, 2014.
[3] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text classification algorithms: A survey," Information, vol. 10, no. 4, p. 150, 2019.
[4] M. S. Shelke, P. R. Deshmukh, and V. K. Shandilya, "A review on imbalanced data handling using undersampling and oversampling technique," International Journal of Recent Trends in Engineering and Research, vol. 3, no. 4, pp. 444-449, 2017.
[5] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[6] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International conference on intelligent computing, 2005: Springer, pp. 878-887.
[7] H. M. Nguyen, E. W. Cooper, and K. Kamei, "Borderline over-sampling for imbalanced data classification," International Journal of Knowledge Engineering and Soft Data Paradigms, vol. 3, no. 1, pp. 4-21, 2011.
[8] G. E. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20-29, 2004.
[9] I. Mani and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," in Proceedings of workshop on learning from imbalanced datasets, 2003, vol. 126: ICML United States.
[10] Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2007.
[11] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, "Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data," Knowledge-Based Systems, vol. 94, pp. 88-104, 2016.
[12] M. M. Rahman and D. N. Davis, "Addressing the class imbalance problem in medical datasets," International Journal of Machine Learning and Computing, vol. 3, no. 2, p. 224, 2013.
[13] KEEL. (22.04.2021). Pima Indians Diabetes Dataset [Online]. Available: https://sci2s.ugr.es/keel/dataset.php?cod=21.
[14] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
[15] A. Liaw and M. Wiener, "Classification and regression by randomForest," R news, vol. 2, no. 3, pp. 18-22, 2002.
[16] M. Ercire, "Classification of short-term power quality disturbances with wavelet analysis and random forest method," Ph.D Doctoral 2019.
[17] S. Narkhede, "Understanding auc-roc curve," Towards Data Science, vol. 26, pp. 220-227, 2018.

There are 17 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Makaleler
Authors	Abdullah Dal 0000-0001-9306-6276 İbrahim Halil Gümüş 0000-0002-3071-1159 Serkan Güldal 0000-0002-4247-0786 Mustafa Yavaş 0000-0002-9111-9095
Publication Date	December 31, 2021
Submission Date	May 26, 2021
Published in Issue	Year 2021 Volume: 8 Issue: 15

Cite

APA	Dal, A., Gümüş, İ. H., Güldal, S., Yavaş, M. (2021). DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, 8(15), 343-352. https://doi.org/10.54365/adyumbd.940539
AMA	Dal A, Gümüş İH, Güldal S, Yavaş M. DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. December 2021;8(15):343-352. doi:10.54365/adyumbd.940539
Chicago	Dal, Abdullah, İbrahim Halil Gümüş, Serkan Güldal, and Mustafa Yavaş. “DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 8, no. 15 (December 2021): 343-52. https://doi.org/10.54365/adyumbd.940539.
EndNote	Dal A, Gümüş İH, Güldal S, Yavaş M (December 1, 2021) DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 8 15 343–352.
IEEE	A. Dal, İ. H. Gümüş, S. Güldal, and M. Yavaş, “DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI”, Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, vol. 8, no. 15, pp. 343–352, 2021, doi: 10.54365/adyumbd.940539.
ISNAD	Dal, Abdullah et al. “DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 8/15 (December 2021), 343-352. https://doi.org/10.54365/adyumbd.940539.
JAMA	Dal A, Gümüş İH, Güldal S, Yavaş M. DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. 2021;8:343–352.
MLA	Dal, Abdullah et al. “DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, vol. 8, no. 15, 2021, pp. 343-52, doi:10.54365/adyumbd.940539.
Vancouver	Dal A, Gümüş İH, Güldal S, Yavaş M. DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. 2021;8(15):343-52.

Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi

DENGESİZ VERİLER İÇİN AĞIRLIKLI GEOMETRİK ORTALAMA TABANLI YENİ BİR YENİDEN ÖRNEKLEME YAKLAŞIMI

Abstract

Keywords

References

A NEW RESAMPLING APPROACH BASED ON WEIGHTED GEOMETRIC MEAN FOR UNBALANCED DATA

Abstract

Keywords

References

Details

Cite

Cited By

DİYABET RİSK DURUMUNUN BELİRLENMESİNDE SINIFLANDIRMA ALGORİTMALARININ PERFORMANSLARININ KAPSAMLI BİR ŞEKİLDE KARŞILAŞTIRILMASI

Kahramanmaraş Sütçü İmam Üniversitesi Mühendislik Bilimleri Dergisi

https://doi.org/10.17780/ksujes.1465177

Yazılım Hata Tahmininde Farklı Alt Örnekleme ve Üst Örnekleme Yöntemlerinin Kıyaslanması

Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi

https://doi.org/10.54525/tbbmd.1235547

Tıbbi Verilerde Heinz Ortalamasına Dayalı Yeni Sentetik Veriler Üreterek Veri Kümesini Dengeleme

Afyon Kocatepe University Journal of Sciences and Engineering

https://doi.org/10.35414/akufemubid.1011058