Machine Learning Based Classification for Spam Detection

Serkan Keskin; Onur Sevli

doi:10.16984/saufenbilder.1264476

Research Article

Machine Learning Based Classification for Spam Detection

Year 2024, Volume: 28 Issue: 2, 270 - 282, 30.04.2024

Serkan Keskin , Onur Sevli

https://doi.org/10.16984/saufenbilder.1264476

Cited By: 1

Abstract

Electronic Electronic messages, i.e. e-mails, are a communication tool frequently used by individuals or organizations. While e-mail is extremely practical to use, it is necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages created to promote a product or service, often sent frequently. It is very important to classify incoming e-mails in order to protect against malware that can be transmitted via e-mail and to reduce possible unwanted consequences. Spam email classification is the process of identifying and distinguishing spam emails from legitimate emails. This classification can be done through various methods such as keyword filtering, machine learning algorithms and image recognition. The goal of spam email classification is to prevent unwanted and potentially harmful emails from reaching the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms are used to classify spam emails and the results are compared. Algorithms with different approaches were used to determine the best solution for the problem. 5558 spam and non-spam e-mails were analyzed and the performance of the algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score metrics. The most successful result was obtained with the RF algorithm with an accuracy of 98.83%. In this study, high success was achieved by classifying spam emails with machine learning algorithms. In addition, it has been proved by experimental studies that better results are obtained than similar studies in the literature.

Keywords

Artificial Intelligence, Email Classification, Machine Learning, Spam Detection

References

[1] E. G. Dada, J. S. Bassi, H. Chiroma, A. O. Adetunmbi, & O. E. Ajibuwa, “Machine learning for email spam filtering: review, approaches and open research problems.”Heliyon, 5(6), e01802, 2019.
[2] L.Ceci (2022, Nov. 14). Number of e-mail users worldwide [online]. Available:https://www.statista.com/statistics/255080/number-of-e-mail-users-worldwide/
[3] S. Dixon (2022, Apr. 28) Daily spam volume worldwide Available: https://www.statista.com/statistics/1270424/daily-spam-volume-global/
[4] P.Pantel, D. L. Spamcop, "A Spam Classification and Organization Program." Learning for Text Categorization, 2006.
[5] S. Zeadally, E. Adi, Z. Baig, & I. A. Khan, "Harnessing artificial intelligence capabilities to improve cybersecurity." Ieee Access 8, 23817-23837, 2020.
[6] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, & M. Alazab, "A comprehensive survey for intelligent spam email detection." IEEE Access 7, 168261-168295, 2019.
[7] T. Dogan, "On Term Weighting for Spam SMS Filtering." Sakarya University Journal of Computer and Information Sciences 3.3, 239-249, 2020.
[8] S. Douzi, F. A. AlShahwan, M. Lemoudden, & B. El Ouahidi, "Hybrid email spam detection model using artificial intelligence." International Journal of Machine Learning and Computing 10.2 2020.
[9] E. M. Onyema, S. Dalal, C. A. T. Romero, B. Seth, P. Young, & M. A. Wajid, "Design of intrusion detection system based on cyborg intelligence for security of cloud network traffic of smart cities." Journal of Cloud Computing 11.1, 1-20, 2022.
[10] A. Bhowmick, S. M. Hazarika, "E-mail spam filtering: a review of techniques and trends." Advances in Electronics, Communication and Computing: ETAEERE-2016, 583-590, 2018.
[11] D. Abidin, The Effect of Derived Features on Art Genre Classification with Machine Learning. Sakarya University Journal of Science, 25(6), 1275-1286, 2021
[12] P. Sharma, U. Bhardwaj. "Machine learning based spam e-mail detection. "International Journal of Intelligent Engineering and Systems 11.3, 1-10, 2018
[13] Ö. Şahinaslan, H. Dalyan, E. Şahinaslan, "Naive bayes sınıflandırıcısı kullanılarak youtube verileri üzerinden çok dilli duygu analizi. "Bilişim Teknolojileri Dergisi 15.2, 221-229, 2022
[14] A. Junnarkar, S. Adhikari, J. Fagania, P. Chimurkar, D. Karia "E-mail spam classification via machine learning and natural language processing." 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). IEEE, 2021.
[15] Y. S. Bozan, Ö. Çoban, G. T. Özyer, & B. Özyer, "SMS spam filtering based on text classification and expert system." 2015 23nd Signal Processing and Communications Applications Conference (SIU). IEEE, 2015.
[16] A. K. A. Salihi, Spam detection by using word-vector learning algorithm in online social networks. MS thesis. Fen Bilimleri Enstitüsü, 2019.
[17] H. Karamollaoglu, İ. A. Dogru, M. Dorterler, "Detection of Spam E-mails with Machine Learning Methods. "2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018.
[18] M. T. Ma, K. Yamamori, A. Thida, "A comparative approach to Naïve Bayes classifier and support vector machine for email spam classification."2020 IEEE 9th Global Conference on Consumer Electronics (GCCE). IEEE, 2020.
[19] B. K. Dedeturk, B. Akay. "Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. "Applied Soft Computing 91 106229, 2020.
[20] N. Baktır, A. Yılmaz, "Makine Öğrenmesi Yaklaşımlarının Spam-Mail Sınıflandırma Probleminde Karşılaştırmalı Analizi. "Bilişim Teknolojileri Dergisi 15.3: 349-364, 2022.
[21] F. Jánez-Martino, E. Fidalgo, S. González-Martínez, J. Velasco-Mata, "Classification of spam emails through hierarchical clustering and supervised learning. "arXiv preprint arXiv: 2005.08773, 2020.
[22] R. Mansoor, N. D. Jayasinghe, M. M. A. Muslam. "A comprehensive review on email spam classification using machine learning algorithms. "2021 International Conference on Information Networking (ICOIN). IEEE, 2021.
[23] A. Yıldız, M. Demirci, Kurumsal e-posta sınıflandırma sistemi. Diss. Yüksek Lisans Tezi, Gazi Üniversitesi Fen Bilimleri Enstitüsü, 82, Ankara, 2017.
[24] I. J. Alkaht, B. Al-Khatib. "Filtering spam using several stages neural networks." Int. Rev. Comp. Softw 11.2, 2016.
[25] A. Sharma, A. Suryawanshi. "A novel method for detecting spam email using KNN classification with spearman correlation as distance measure. "International Journal of Computer Applications 136.6, 28-35, 2016
[26] Jain, T., Garg, P., Chalil, N., Sinha, A., Verma, V. K., & Gupta, R. SMS spam classification using machine learning techniques. In 2022 12th international conference on cloud computing, data science & engineering (confluence) (pp. 273-279). IEEE, 2022.
[27] Gadde, S., Lakshmanarao, A., & Satyanarayana, S. SMS spam detection using machine learning and deep learning techniques. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS) (Vol. 1, pp. 358-362). IEEE, 2021.
[28] Reddy, G. A., & Reddy, B. I. Classification of Spam Text using SVM. Journal of University of Shanghai for Science and Technology, 23(8), 616-624, 2021
[29] Kumar, R., Murthy, K. S. R., Ramesh Babu, J., & Shaik, A. Live Text Analyzer to Detect Unsolicited Messages Using Count Vectorizer. Journal of Engineering Sciences, 14(06), 2023.
[30] Abayomi‐Alli, O., Misra, S., & Abayomi‐Alli, A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation: Practice and Experience, 34 (17), e6989, 2022.
[31] ‘Email Spam Detection 98% Accuracy | Kaggle’. https://www.kaggle.com/code/mfaisalqureshi/email-spam-detection-98-accuracy/data (accessed Aug. 21, 2023).
[32] M. Zhou, N. Duan, S. Liu, H. Y. Shum, "Progress in neural NLP: modeling, learning, and reasoning."Engineering 6.3, 275-290, 2020.
[33] I. Yahav, O. Shehory, D. Schwartz, "Comments mining with TF-IDF: the inherent bias and its removal. "IEEE Transactions on Knowledge and Data Engineering 31.3, 437-450, 2018
[34] Y. Altuntaş, A. F. Kocamaz, A. M. Ülkgün, "Determination of Individual Investors' Financial Risk Tolerance by Machine Learning Methods. "2020 28th Signal Processing and Communications Applications Conference (SIU). IEEE, 2020.
[35] R. Gürfidan, M. Ersoy, "Classification of death related to heart failure by machine learning algorithms. "Advances in Artificial Intelligence Research 1.1, 13-18, 2021
[36] S. Şenel, B. Alatli. "Lojistik regresyon analizinin kullanıldığı makaleler üzerine bir inceleme. "Journal of Measurement and Evaluation in Education and Psychology 5.1, 35-52, 2014.
[37] A. McCallum, K. Nigam. "A comparison of event models for naive bayes text classification. "AAAI-98 workshop on learning for text categorization. Vol. 752. No. 1. 1998.
[38] V. Metsis, I. Androutsopoulos, G. Paliouras. "Spam filtering with naive bayes-which naive bayes?", CEAS. Vol. 17. 2006.
[39] F. M. Avcu, "Az Veri Setli Çalışmalarında Derin Öğrenme Ve Diğer Sınıflandırma Algoritmalarının Karşılaştırılması: Agonist Ve Antagonist Ligand Örneği "İnönü Üniversitesi Sağlık Hizmetleri Meslek Yüksek Okulu Dergisi 10.1, 356-371, 2022
[40] Ö. Akar, O. Güngör, "Rastgele orman algoritması kullanılarak çok bantlı görüntülerin sınıflandırılması. "Jeodezi ve Jeoinformasyon Dergisi 106, 139-146, 2012.
[41] A. Arı, M. E. Berberler, "Yapay sinir ağları ile tahmin ve sınıflandırma problemlerinin çözümü için arayüz tasarımı. "Acta Infologica 1.2, 55-73, 2017
[42] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, A. M. Umar, O. U. Linus, M. U. Kiru, "Comprehensive review of artificial neural network applications to pattern recognition. "IEEE Access 7, 158820-158846, 2019
[43] Z. K. Şentürk, "Artificial neural networks based decision support system for the detection of diabetic retinopathy. "Sakarya Üniversitesi Fen Bilimleri Enstitüsü Dergisi 24.2, 424-431, 2020.
[44] N. Nazlı, Analysis of machine learning-based spam filtering techniques. MS thesis. 2018.
[45] B. Kale, Veri madenciliği sınıflandırma algoritmaları ile e-posta önemliliğinin belirlenmesi. MS thesis. Fen Bilimleri Enstitüsü, 2018.
[46] M. Zavvar, M. Rezaei, S. Garavand. "Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine. "International Journal of Modern Education and Computer Science 8.7, 68, 2016.

Year 2024, Volume: 28 Issue: 2, 270 - 282, 30.04.2024

Serkan Keskin , Onur Sevli

https://doi.org/10.16984/saufenbilder.1264476

Cited By: 1

Abstract

References

[1] E. G. Dada, J. S. Bassi, H. Chiroma, A. O. Adetunmbi, & O. E. Ajibuwa, “Machine learning for email spam filtering: review, approaches and open research problems.”Heliyon, 5(6), e01802, 2019.
[2] L.Ceci (2022, Nov. 14). Number of e-mail users worldwide [online]. Available:https://www.statista.com/statistics/255080/number-of-e-mail-users-worldwide/
[3] S. Dixon (2022, Apr. 28) Daily spam volume worldwide Available: https://www.statista.com/statistics/1270424/daily-spam-volume-global/
[4] P.Pantel, D. L. Spamcop, "A Spam Classification and Organization Program." Learning for Text Categorization, 2006.
[5] S. Zeadally, E. Adi, Z. Baig, & I. A. Khan, "Harnessing artificial intelligence capabilities to improve cybersecurity." Ieee Access 8, 23817-23837, 2020.
[6] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, & M. Alazab, "A comprehensive survey for intelligent spam email detection." IEEE Access 7, 168261-168295, 2019.
[7] T. Dogan, "On Term Weighting for Spam SMS Filtering." Sakarya University Journal of Computer and Information Sciences 3.3, 239-249, 2020.
[8] S. Douzi, F. A. AlShahwan, M. Lemoudden, & B. El Ouahidi, "Hybrid email spam detection model using artificial intelligence." International Journal of Machine Learning and Computing 10.2 2020.
[9] E. M. Onyema, S. Dalal, C. A. T. Romero, B. Seth, P. Young, & M. A. Wajid, "Design of intrusion detection system based on cyborg intelligence for security of cloud network traffic of smart cities." Journal of Cloud Computing 11.1, 1-20, 2022.
[10] A. Bhowmick, S. M. Hazarika, "E-mail spam filtering: a review of techniques and trends." Advances in Electronics, Communication and Computing: ETAEERE-2016, 583-590, 2018.
[11] D. Abidin, The Effect of Derived Features on Art Genre Classification with Machine Learning. Sakarya University Journal of Science, 25(6), 1275-1286, 2021
[12] P. Sharma, U. Bhardwaj. "Machine learning based spam e-mail detection. "International Journal of Intelligent Engineering and Systems 11.3, 1-10, 2018
[13] Ö. Şahinaslan, H. Dalyan, E. Şahinaslan, "Naive bayes sınıflandırıcısı kullanılarak youtube verileri üzerinden çok dilli duygu analizi. "Bilişim Teknolojileri Dergisi 15.2, 221-229, 2022
[14] A. Junnarkar, S. Adhikari, J. Fagania, P. Chimurkar, D. Karia "E-mail spam classification via machine learning and natural language processing." 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). IEEE, 2021.
[15] Y. S. Bozan, Ö. Çoban, G. T. Özyer, & B. Özyer, "SMS spam filtering based on text classification and expert system." 2015 23nd Signal Processing and Communications Applications Conference (SIU). IEEE, 2015.
[16] A. K. A. Salihi, Spam detection by using word-vector learning algorithm in online social networks. MS thesis. Fen Bilimleri Enstitüsü, 2019.
[17] H. Karamollaoglu, İ. A. Dogru, M. Dorterler, "Detection of Spam E-mails with Machine Learning Methods. "2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018.
[18] M. T. Ma, K. Yamamori, A. Thida, "A comparative approach to Naïve Bayes classifier and support vector machine for email spam classification."2020 IEEE 9th Global Conference on Consumer Electronics (GCCE). IEEE, 2020.
[19] B. K. Dedeturk, B. Akay. "Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. "Applied Soft Computing 91 106229, 2020.
[20] N. Baktır, A. Yılmaz, "Makine Öğrenmesi Yaklaşımlarının Spam-Mail Sınıflandırma Probleminde Karşılaştırmalı Analizi. "Bilişim Teknolojileri Dergisi 15.3: 349-364, 2022.
[21] F. Jánez-Martino, E. Fidalgo, S. González-Martínez, J. Velasco-Mata, "Classification of spam emails through hierarchical clustering and supervised learning. "arXiv preprint arXiv: 2005.08773, 2020.
[22] R. Mansoor, N. D. Jayasinghe, M. M. A. Muslam. "A comprehensive review on email spam classification using machine learning algorithms. "2021 International Conference on Information Networking (ICOIN). IEEE, 2021.
[23] A. Yıldız, M. Demirci, Kurumsal e-posta sınıflandırma sistemi. Diss. Yüksek Lisans Tezi, Gazi Üniversitesi Fen Bilimleri Enstitüsü, 82, Ankara, 2017.
[24] I. J. Alkaht, B. Al-Khatib. "Filtering spam using several stages neural networks." Int. Rev. Comp. Softw 11.2, 2016.
[25] A. Sharma, A. Suryawanshi. "A novel method for detecting spam email using KNN classification with spearman correlation as distance measure. "International Journal of Computer Applications 136.6, 28-35, 2016
[26] Jain, T., Garg, P., Chalil, N., Sinha, A., Verma, V. K., & Gupta, R. SMS spam classification using machine learning techniques. In 2022 12th international conference on cloud computing, data science & engineering (confluence) (pp. 273-279). IEEE, 2022.
[27] Gadde, S., Lakshmanarao, A., & Satyanarayana, S. SMS spam detection using machine learning and deep learning techniques. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS) (Vol. 1, pp. 358-362). IEEE, 2021.
[28] Reddy, G. A., & Reddy, B. I. Classification of Spam Text using SVM. Journal of University of Shanghai for Science and Technology, 23(8), 616-624, 2021
[29] Kumar, R., Murthy, K. S. R., Ramesh Babu, J., & Shaik, A. Live Text Analyzer to Detect Unsolicited Messages Using Count Vectorizer. Journal of Engineering Sciences, 14(06), 2023.
[30] Abayomi‐Alli, O., Misra, S., & Abayomi‐Alli, A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation: Practice and Experience, 34 (17), e6989, 2022.
[31] ‘Email Spam Detection 98% Accuracy | Kaggle’. https://www.kaggle.com/code/mfaisalqureshi/email-spam-detection-98-accuracy/data (accessed Aug. 21, 2023).
[32] M. Zhou, N. Duan, S. Liu, H. Y. Shum, "Progress in neural NLP: modeling, learning, and reasoning."Engineering 6.3, 275-290, 2020.
[33] I. Yahav, O. Shehory, D. Schwartz, "Comments mining with TF-IDF: the inherent bias and its removal. "IEEE Transactions on Knowledge and Data Engineering 31.3, 437-450, 2018
[34] Y. Altuntaş, A. F. Kocamaz, A. M. Ülkgün, "Determination of Individual Investors' Financial Risk Tolerance by Machine Learning Methods. "2020 28th Signal Processing and Communications Applications Conference (SIU). IEEE, 2020.
[35] R. Gürfidan, M. Ersoy, "Classification of death related to heart failure by machine learning algorithms. "Advances in Artificial Intelligence Research 1.1, 13-18, 2021
[36] S. Şenel, B. Alatli. "Lojistik regresyon analizinin kullanıldığı makaleler üzerine bir inceleme. "Journal of Measurement and Evaluation in Education and Psychology 5.1, 35-52, 2014.
[37] A. McCallum, K. Nigam. "A comparison of event models for naive bayes text classification. "AAAI-98 workshop on learning for text categorization. Vol. 752. No. 1. 1998.
[38] V. Metsis, I. Androutsopoulos, G. Paliouras. "Spam filtering with naive bayes-which naive bayes?", CEAS. Vol. 17. 2006.
[39] F. M. Avcu, "Az Veri Setli Çalışmalarında Derin Öğrenme Ve Diğer Sınıflandırma Algoritmalarının Karşılaştırılması: Agonist Ve Antagonist Ligand Örneği "İnönü Üniversitesi Sağlık Hizmetleri Meslek Yüksek Okulu Dergisi 10.1, 356-371, 2022
[40] Ö. Akar, O. Güngör, "Rastgele orman algoritması kullanılarak çok bantlı görüntülerin sınıflandırılması. "Jeodezi ve Jeoinformasyon Dergisi 106, 139-146, 2012.
[41] A. Arı, M. E. Berberler, "Yapay sinir ağları ile tahmin ve sınıflandırma problemlerinin çözümü için arayüz tasarımı. "Acta Infologica 1.2, 55-73, 2017
[42] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, A. M. Umar, O. U. Linus, M. U. Kiru, "Comprehensive review of artificial neural network applications to pattern recognition. "IEEE Access 7, 158820-158846, 2019
[43] Z. K. Şentürk, "Artificial neural networks based decision support system for the detection of diabetic retinopathy. "Sakarya Üniversitesi Fen Bilimleri Enstitüsü Dergisi 24.2, 424-431, 2020.
[44] N. Nazlı, Analysis of machine learning-based spam filtering techniques. MS thesis. 2018.
[45] B. Kale, Veri madenciliği sınıflandırma algoritmaları ile e-posta önemliliğinin belirlenmesi. MS thesis. Fen Bilimleri Enstitüsü, 2018.
[46] M. Zavvar, M. Rezaei, S. Garavand. "Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine. "International Journal of Modern Education and Computer Science 8.7, 68, 2016.

There are 46 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence
Journal Section	Research Articles
Authors	Serkan Keskin 0000-0001-9404-5039 Onur Sevli 0000-0002-8933-8395
Early Pub Date	April 22, 2024
Publication Date	April 30, 2024
Submission Date	March 13, 2023
Acceptance Date	December 8, 2023
Published in Issue	Year 2024 Volume: 28 Issue: 2

Cite

APA	Keskin, S., & Sevli, O. (2024). Machine Learning Based Classification for Spam Detection. Sakarya University Journal of Science, 28(2), 270-282. https://doi.org/10.16984/saufenbilder.1264476
AMA	Keskin S, Sevli O. Machine Learning Based Classification for Spam Detection. SAUJS. April 2024;28(2):270-282. doi:10.16984/saufenbilder.1264476
Chicago	Keskin, Serkan, and Onur Sevli. “Machine Learning Based Classification for Spam Detection”. Sakarya University Journal of Science 28, no. 2 (April 2024): 270-82. https://doi.org/10.16984/saufenbilder.1264476.
EndNote	Keskin S, Sevli O (April 1, 2024) Machine Learning Based Classification for Spam Detection. Sakarya University Journal of Science 28 2 270–282.
IEEE	S. Keskin and O. Sevli, “Machine Learning Based Classification for Spam Detection”, SAUJS, vol. 28, no. 2, pp. 270–282, 2024, doi: 10.16984/saufenbilder.1264476.
ISNAD	Keskin, Serkan - Sevli, Onur. “Machine Learning Based Classification for Spam Detection”. Sakarya University Journal of Science 28/2 (April 2024), 270-282. https://doi.org/10.16984/saufenbilder.1264476.
JAMA	Keskin S, Sevli O. Machine Learning Based Classification for Spam Detection. SAUJS. 2024;28:270–282.
MLA	Keskin, Serkan and Onur Sevli. “Machine Learning Based Classification for Spam Detection”. Sakarya University Journal of Science, vol. 28, no. 2, 2024, pp. 270-82, doi:10.16984/saufenbilder.1264476.
Vancouver	Keskin S, Sevli O. Machine Learning Based Classification for Spam Detection. SAUJS. 2024;28(2):270-82.