Research Article
BibTex RIS Cite

Effect of different encoding techniques on the mushroom classification performance of KNN algorithm

Year 2025, Volume: 14 Issue: 1, 263 - 270, 15.01.2025
https://doi.org/10.28948/ngumuh.1515387

Abstract

In this study, the effects of different encoding techniques on the K-Nearest Neighbors (KNN) algorithm in the classification of mushrooms as poisonous or edible were investigated. Various encoding techniques such as label encoding, one-hot encoding, frequency encoding, hash encoding, and target encoding were used to convert categorical features in a dataset, which mostly contains categorical features, into numerical data. The performance of the model was evaluated using metrics such as accuracy, precision, recall, and f1-score. The results revealed that frequency encoding showed the best performance at k=1, while target encoding showed the lowest performance at k=7. The findings of the study provide significant insights into understanding the impact of categorical data transformation on the KNN model and achieving more accurate classification results.

References

  • C. Pan, A. Poddar, R. Mukherjee, and A.K. Ray, Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomedical Signal Processing and Control, 76, 103666, 2022. https://doi.org/10.1016/j.bspc.2022.103666.
  •     K.S. Sree, J. Karthik, C. Niharika, P.V.V.S. Srinivas, N. Ravinder, and C. Prasad, Optimized conversion of categorical and numerical features in machine learning models. In 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 294-299, IEEE, November 2021. https://doi.org/10.1109/I-SMAC52330.2021.9640967.
  •     G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, pp. 986-996, November 3-7, 2003.
  •     H. Gupta and V. Asha, Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience, vol. 17, no. 9-10, pp. 4197-4201, 2020. https://doi.org/10.1166/jctn.2020.9044.
  •     P. Yan, Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data. 2019.
  •     K. Budholiya, S.K. Shrivastava, and V. Sharma, An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4514-4523, 2022. https://doi.org/10.1016/j.jksuci.2020.10.013.
  •     T. Al-Shehari and R.A. Alsowail, An insider data leakage detection using one-hot encoding, synthetic minority oversampling, and machine learning techniques. Entropy, vol. 23, no. 10, p. 1258, 2021. https://doi.org/10.3390/e23101258.
  •     M. Hosni, Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. in KDIR, pp. 460-467, 2023.
  •     M.X. Low, T.T.V. Yap, W.K. Soo, H. Ng, V.T. Goh, J.J. Chin, and T.Y. Kuek, Comparison of label encoding and evidence counting for malware classification. Journal of System and Management Sciences, vol. 12, no. 6, pp. 17-30, 2022. https://doi.org/10.33168/JSMS.2022.0602.
  •   S.K. Das and M.Z. Rahman, A Study on Machine Learning Algorithms with Different Encoding Techniques for Identifying the Right One for Patients' Big Data. Jahangirnagar University Journal of Science, vol. 43, no. 1, pp. 63-78, 2021.
  •   F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, vol. 37, no. 5, pp. 2671-2692, 2022. https://doi.org/10.1007/s00180-022-01207-6.
  •   A.S. Mohanty, K.C. Patra, and P. Parida, Toddler ASD Classification Using Machine Learning Techniques. International Journal of Online & Biomedical Engineering, vol. 17, no. 7, 2021. https://doi.org/ 10.3991/ijoe.v17i07.23497.
  •   S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, Improvement of the performance of models for predicting coronary artery disease based on XGBoost algorithm and feature processing technology. Electronics, vol. 11, no. 3, p. 315, 2022. https://doi.org/10.3390/electronics11030315.
  •   L.B. Nascimento, M. de Sousa Balbino, M.L. Teodoro, and C.N. Nobre, Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression. In BIOSTEC (2), pp. 482-489, 2024.
  •   F. Pargent, B. Bischl, and J. Thomas, A benchmark experiment on how to encode categorical features in predictive modeling. München: Ludwig-Maximilians-Universität München, 2019.
  •   D. Wagner, D. Heider, and G. Hattab, Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports, vol. 11, no. 1, p. 8134, 2021. https://doi.org/10.1038/s41598-021-87602-3.
  •   UCI Machine Learning Repository, Secondary Mushroom. https://archive.ics.uci.edu/dataset/848/ secondary +mushroom+dataset, Accessed 25 June 2024.
  •   UCI Machine Learning Repository, Mushroom. https://archive.ics.uci.edu/dataset/73/mushroom, Accessed 25 June 2024.
  •   M.K. Dahouda and I. Joe, A deep-learned embedding technique for categorical features encoding. IEEE Access, vol. 9, pp. 114381-114391, 2021. https://doi.org/10.1109/ACCESS.2021.3104357.
  •   C. Seger, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. 2018.
  •   C.T.T. Thuy, K.A. Tran, and C.N. Giap, Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of Vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, vol. 11, no. 11, 2020. https://doi.org/10.14569/IJACSA.2020.0111135.
  •   I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, H. Galeana-Zapién, V. Muñiz-Sanchez, and S. Gausin-Valle, A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy, vol. 22, no. 12, p. 1391, 2020. https://doi.org/10.3390/e22121391.
  •   S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables. Journal of Intelligent Information Systems, vol. 58, no. 3, pp. 613-640, 2022. https://doi.org/10.1007/s10844-021-00693-2.
  •   A. Almomany, W.R. Ayyad, and A. Jarrah, Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 3815-3827, 2022. https://doi.org/10.1016/j.jksuci.2022.04.006.
  •   J. Lever, Classification evaluation: It is important to understand both what a classification metric expresses and what it hides. Nature Methods, vol. 13, no. 8, pp. 603-605, 2016.
  •   Ž. Vujović, Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 599-606, 2021.
  •   H. Jabbar and R.Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, vol. 70, no. 10.3850, pp. 978-981, 2015.

Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi

Year 2025, Volume: 14 Issue: 1, 263 - 270, 15.01.2025
https://doi.org/10.28948/ngumuh.1515387

Abstract

Bu çalışmada, mantarların zehirli veya yenilebilir olarak sınıflandırılmasında farklı kodlama tekniklerinin K-En Yakın Komşu (KNN) algoritması üzerindeki etkisi araştırılmıştır. Etiket kodlama, one-hot kodlama, frekans kodlama, hash kodlama ve hedef kodlama gibi çeşitli kodlama teknikleri kullanılarak, çoğunlukla kategorik özellikler içiren bir veri setindeki kategorik özellikler sayısal verilere dönüştürülmüştür. Modelin performansı doğruluk, kesinlik, duyarlılık ve f1-skoru gibi metriklerle değerlendirilmiştir. Sonuçlar, frekans kodlamanın k=1 durumunda en iyi performansı sergilediğini, hedef kodlamanın ise k=7 durumunda en düşük performansı gösterdiğini ortaya koymuştur. Çalışmanın bulguları, kategorik veri dönüşümünün KNN modeli üzerindeki etkilerini anlamak ve daha doğru sınıflandırma sonuçları elde etmek için önemli ipuçları sunmaktadır.

References

  • C. Pan, A. Poddar, R. Mukherjee, and A.K. Ray, Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomedical Signal Processing and Control, 76, 103666, 2022. https://doi.org/10.1016/j.bspc.2022.103666.
  •     K.S. Sree, J. Karthik, C. Niharika, P.V.V.S. Srinivas, N. Ravinder, and C. Prasad, Optimized conversion of categorical and numerical features in machine learning models. In 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 294-299, IEEE, November 2021. https://doi.org/10.1109/I-SMAC52330.2021.9640967.
  •     G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, pp. 986-996, November 3-7, 2003.
  •     H. Gupta and V. Asha, Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience, vol. 17, no. 9-10, pp. 4197-4201, 2020. https://doi.org/10.1166/jctn.2020.9044.
  •     P. Yan, Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data. 2019.
  •     K. Budholiya, S.K. Shrivastava, and V. Sharma, An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4514-4523, 2022. https://doi.org/10.1016/j.jksuci.2020.10.013.
  •     T. Al-Shehari and R.A. Alsowail, An insider data leakage detection using one-hot encoding, synthetic minority oversampling, and machine learning techniques. Entropy, vol. 23, no. 10, p. 1258, 2021. https://doi.org/10.3390/e23101258.
  •     M. Hosni, Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. in KDIR, pp. 460-467, 2023.
  •     M.X. Low, T.T.V. Yap, W.K. Soo, H. Ng, V.T. Goh, J.J. Chin, and T.Y. Kuek, Comparison of label encoding and evidence counting for malware classification. Journal of System and Management Sciences, vol. 12, no. 6, pp. 17-30, 2022. https://doi.org/10.33168/JSMS.2022.0602.
  •   S.K. Das and M.Z. Rahman, A Study on Machine Learning Algorithms with Different Encoding Techniques for Identifying the Right One for Patients' Big Data. Jahangirnagar University Journal of Science, vol. 43, no. 1, pp. 63-78, 2021.
  •   F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, vol. 37, no. 5, pp. 2671-2692, 2022. https://doi.org/10.1007/s00180-022-01207-6.
  •   A.S. Mohanty, K.C. Patra, and P. Parida, Toddler ASD Classification Using Machine Learning Techniques. International Journal of Online & Biomedical Engineering, vol. 17, no. 7, 2021. https://doi.org/ 10.3991/ijoe.v17i07.23497.
  •   S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, Improvement of the performance of models for predicting coronary artery disease based on XGBoost algorithm and feature processing technology. Electronics, vol. 11, no. 3, p. 315, 2022. https://doi.org/10.3390/electronics11030315.
  •   L.B. Nascimento, M. de Sousa Balbino, M.L. Teodoro, and C.N. Nobre, Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression. In BIOSTEC (2), pp. 482-489, 2024.
  •   F. Pargent, B. Bischl, and J. Thomas, A benchmark experiment on how to encode categorical features in predictive modeling. München: Ludwig-Maximilians-Universität München, 2019.
  •   D. Wagner, D. Heider, and G. Hattab, Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports, vol. 11, no. 1, p. 8134, 2021. https://doi.org/10.1038/s41598-021-87602-3.
  •   UCI Machine Learning Repository, Secondary Mushroom. https://archive.ics.uci.edu/dataset/848/ secondary +mushroom+dataset, Accessed 25 June 2024.
  •   UCI Machine Learning Repository, Mushroom. https://archive.ics.uci.edu/dataset/73/mushroom, Accessed 25 June 2024.
  •   M.K. Dahouda and I. Joe, A deep-learned embedding technique for categorical features encoding. IEEE Access, vol. 9, pp. 114381-114391, 2021. https://doi.org/10.1109/ACCESS.2021.3104357.
  •   C. Seger, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. 2018.
  •   C.T.T. Thuy, K.A. Tran, and C.N. Giap, Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of Vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, vol. 11, no. 11, 2020. https://doi.org/10.14569/IJACSA.2020.0111135.
  •   I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, H. Galeana-Zapién, V. Muñiz-Sanchez, and S. Gausin-Valle, A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy, vol. 22, no. 12, p. 1391, 2020. https://doi.org/10.3390/e22121391.
  •   S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables. Journal of Intelligent Information Systems, vol. 58, no. 3, pp. 613-640, 2022. https://doi.org/10.1007/s10844-021-00693-2.
  •   A. Almomany, W.R. Ayyad, and A. Jarrah, Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 3815-3827, 2022. https://doi.org/10.1016/j.jksuci.2022.04.006.
  •   J. Lever, Classification evaluation: It is important to understand both what a classification metric expresses and what it hides. Nature Methods, vol. 13, no. 8, pp. 603-605, 2016.
  •   Ž. Vujović, Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 599-606, 2021.
  •   H. Jabbar and R.Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, vol. 70, no. 10.3850, pp. 978-981, 2015.
There are 27 citations in total.

Details

Primary Language Turkish
Subjects Reinforcement Learning, Satisfiability and Optimisation, Artificial Intelligence (Other)
Journal Section Research Articles
Authors

Kadir İleri 0000-0002-5041-6165

Early Pub Date December 25, 2024
Publication Date January 15, 2025
Submission Date July 12, 2024
Acceptance Date December 16, 2024
Published in Issue Year 2025 Volume: 14 Issue: 1

Cite

APA İleri, K. (2025). Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 14(1), 263-270. https://doi.org/10.28948/ngumuh.1515387
AMA İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NOHU J. Eng. Sci. January 2025;14(1):263-270. doi:10.28948/ngumuh.1515387
Chicago İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14, no. 1 (January 2025): 263-70. https://doi.org/10.28948/ngumuh.1515387.
EndNote İleri K (January 1, 2025) Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14 1 263–270.
IEEE K. İleri, “Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi”, NOHU J. Eng. Sci., vol. 14, no. 1, pp. 263–270, 2025, doi: 10.28948/ngumuh.1515387.
ISNAD İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14/1 (January 2025), 263-270. https://doi.org/10.28948/ngumuh.1515387.
JAMA İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NOHU J. Eng. Sci. 2025;14:263–270.
MLA İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, vol. 14, no. 1, 2025, pp. 263-70, doi:10.28948/ngumuh.1515387.
Vancouver İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NOHU J. Eng. Sci. 2025;14(1):263-70.

download