Research Article
BibTex RIS Cite

Machine Learning-Based Effective Malicious Web Page Detection

Year 2022, Volume: 11 Issue: 4, 28 - 39, 31.12.2022

Abstract

The use of the Internet is becoming more and more widespread day by day, putting millions of users at risk of cyberattacks.
Especially during the Covid-19 epidemic, internet usage has increased significantly and various cyber-attacks have been
made through malicious websites. With these attacks, much information such as people’s private information, bank information,
and social information can be captured. Many methods have been developed to prevent cyber-attacks. In particular, methods
that use machine learning methods other than traditional methods give more successful results. In this study, it has been tried
to automatically detect malicious websites by using the URL properties of malicious websites. For this purpose, popular machine
learning methods such as DT, kNN, LightGBM, LR, MLP, RF, SVM, and XGBoost were used. According to the experimental results,
the RF algorithm achieved 96% accuracy.

References

  • U. Can and B. Alatas, “Cyberbullying and cyberstalking on online social networks,” in Securing Social Networks in Cyberspace. CRC Press, 2021, pp. 141–162.
  • R. S. ARSLAN, “K¨ot¨uc¨ul url filtreleme ic¸in derin ¨o˘grenme modeli tasarımı,” Avrupa Bilim ve Teknoloji Dergisi, no. 29, pp. 122–128, 2021.
  • S. He, B. Li, H. Peng, J. Xin, and E. Zhang, “An effective cost-sensitive xgboost method for malicious urls detection in imbalanced dataset,” IEEE Access, vol. 9, pp. 93 089–93 096, 2021.
  • A. Sirageldin, B. B. Baharudin, and L. T. Jung, “Malicious web page detection: A machine learning approach,” in Advances in computer science and its applications. Springer, 2014, pp. 217–224.
  • Y.-T. Hou, Y. Chang, T. Chen, C.-S. Laih, and C.-M. Chen, “Malicious web content detection by machine learning,” expert systems with applications, vol. 37, no. 1, pp. 55–60, 2010.
  • J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detect malicious urls,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–24, 2011.
  • W. Zhang, Y.-X. Ding, Y. Tang, and B. Zhao, “Malicious web page detection based on on-line learning algorithm,” in 2011 International Conference on Machine Learning and Cybernetics, vol. 4. IEEE, 2011, pp. 1914–1919.
  • B. Eshete, “Effective analysis, characterization, and detection of malicious web pages,” in Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 355–360.
  • H. B. Kazemian and S. Ahmed, “Comparisons of machine learning techniques for detecting malicious webpages,” Expert Systems with Applications, vol. 42, no. 3, pp. 1166–1177, 2015.
  • O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from urls,” Expert Systems with Applications, vol. 117, pp. 345–357, 2019.
  • D. Liu and J.-H. Lee, “Cnn based malicious website detection by invalidating multiple web spams,” IEEE access, vol. 8, pp. 97 258–97 266, 2020.
  • J. Li, Z. Zhang, and C. Guo, “Machine learning-based malicious x. 509 certificates’ detection,” Applied Sciences, vol. 11, no. 5, p. 2164, 2021.
  • A. S. Raja, R. Vinodini, and A. Kavitha, “Lexical features based malicious url detection using machine learning techniques,” Materials Today: Proceedings, vol. 47, pp. 163–166, 2021.
  • SPSS, AnwerTree Algorithm Summary. USA: SPSS White Paper, 1999.
  • J. Sun and H. Li, “Data mining method for listed companies’ financial distress prediction,” Knowledge-Based Systems, vol. 21, no. 1, pp. 1–5, 2008.
  • T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
  • M. Khan, Q. Ding, and W. Perrizo, “k-nearest neighbor classification on spatial data streams using p-trees,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2002, pp. 517–528.
  • E. Erdem and F. Bozkurt, “A comparison of various supervised machine learning techniques for prostate cancer prediction,” Avrupa Bilim ve Teknoloji Dergisi, no. 21, pp. 610–620, 2021.
  • C. Mood, “Logistic regression: Why we cannot do what we think we can do, and what we can do about it,” European sociological review, vol. 26, no. 1, pp. 67–82, 2010.
  • S. Dom´ınguez-Almendros, N. Ben´ıtez-Parejo, and A. R. Gonzalez-Ramirez, “Logistic regression models,” Allergologia et immunopathologia, vol. 39, no. 5, pp. 295–305, 2011.
  • H. Ramchoun, Y. Ghanou, M. Ettaouil, and M. A. Janati Idrissi, “Multilayer perceptron: Architecture optimization and training,” International Journal of Interactive Multimedia and Artificial Intelligence, 2016.
  • H. Faris, I. Aljarah, N. Al-Madi, and S. Mirjalili, “Optimizing the learning process of feedforward neural networks using lightning search algorithm,” International Journal on Artificial Intelligence Tools, vol. 25, no. 06, p. 1650033, 2016.
  • L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • M. Belgiu and L. Dr˘agut¸, “Random forest in remote sensing: A review of applications and future directions,” ISPRS journal of photogrammetry and remote sensing, vol. 114, pp. 24–31, 2016.
  • M. Mursalin, Y. Zhang, Y. Chen, and N. V. Chawla, “Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier,” Neurocomputing, vol. 241, pp. 204–214, 2017.
  • H. Chen, Z. Lin, H. Wu, L. Wang, T. Wu, and C. Tan, “Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 135, pp. 185–191, 2015.
  • C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
  • T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp.785–794.
  • O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018. 15
Year 2022, Volume: 11 Issue: 4, 28 - 39, 31.12.2022

Abstract

References

  • U. Can and B. Alatas, “Cyberbullying and cyberstalking on online social networks,” in Securing Social Networks in Cyberspace. CRC Press, 2021, pp. 141–162.
  • R. S. ARSLAN, “K¨ot¨uc¨ul url filtreleme ic¸in derin ¨o˘grenme modeli tasarımı,” Avrupa Bilim ve Teknoloji Dergisi, no. 29, pp. 122–128, 2021.
  • S. He, B. Li, H. Peng, J. Xin, and E. Zhang, “An effective cost-sensitive xgboost method for malicious urls detection in imbalanced dataset,” IEEE Access, vol. 9, pp. 93 089–93 096, 2021.
  • A. Sirageldin, B. B. Baharudin, and L. T. Jung, “Malicious web page detection: A machine learning approach,” in Advances in computer science and its applications. Springer, 2014, pp. 217–224.
  • Y.-T. Hou, Y. Chang, T. Chen, C.-S. Laih, and C.-M. Chen, “Malicious web content detection by machine learning,” expert systems with applications, vol. 37, no. 1, pp. 55–60, 2010.
  • J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detect malicious urls,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–24, 2011.
  • W. Zhang, Y.-X. Ding, Y. Tang, and B. Zhao, “Malicious web page detection based on on-line learning algorithm,” in 2011 International Conference on Machine Learning and Cybernetics, vol. 4. IEEE, 2011, pp. 1914–1919.
  • B. Eshete, “Effective analysis, characterization, and detection of malicious web pages,” in Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 355–360.
  • H. B. Kazemian and S. Ahmed, “Comparisons of machine learning techniques for detecting malicious webpages,” Expert Systems with Applications, vol. 42, no. 3, pp. 1166–1177, 2015.
  • O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from urls,” Expert Systems with Applications, vol. 117, pp. 345–357, 2019.
  • D. Liu and J.-H. Lee, “Cnn based malicious website detection by invalidating multiple web spams,” IEEE access, vol. 8, pp. 97 258–97 266, 2020.
  • J. Li, Z. Zhang, and C. Guo, “Machine learning-based malicious x. 509 certificates’ detection,” Applied Sciences, vol. 11, no. 5, p. 2164, 2021.
  • A. S. Raja, R. Vinodini, and A. Kavitha, “Lexical features based malicious url detection using machine learning techniques,” Materials Today: Proceedings, vol. 47, pp. 163–166, 2021.
  • SPSS, AnwerTree Algorithm Summary. USA: SPSS White Paper, 1999.
  • J. Sun and H. Li, “Data mining method for listed companies’ financial distress prediction,” Knowledge-Based Systems, vol. 21, no. 1, pp. 1–5, 2008.
  • T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
  • M. Khan, Q. Ding, and W. Perrizo, “k-nearest neighbor classification on spatial data streams using p-trees,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2002, pp. 517–528.
  • E. Erdem and F. Bozkurt, “A comparison of various supervised machine learning techniques for prostate cancer prediction,” Avrupa Bilim ve Teknoloji Dergisi, no. 21, pp. 610–620, 2021.
  • C. Mood, “Logistic regression: Why we cannot do what we think we can do, and what we can do about it,” European sociological review, vol. 26, no. 1, pp. 67–82, 2010.
  • S. Dom´ınguez-Almendros, N. Ben´ıtez-Parejo, and A. R. Gonzalez-Ramirez, “Logistic regression models,” Allergologia et immunopathologia, vol. 39, no. 5, pp. 295–305, 2011.
  • H. Ramchoun, Y. Ghanou, M. Ettaouil, and M. A. Janati Idrissi, “Multilayer perceptron: Architecture optimization and training,” International Journal of Interactive Multimedia and Artificial Intelligence, 2016.
  • H. Faris, I. Aljarah, N. Al-Madi, and S. Mirjalili, “Optimizing the learning process of feedforward neural networks using lightning search algorithm,” International Journal on Artificial Intelligence Tools, vol. 25, no. 06, p. 1650033, 2016.
  • L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • M. Belgiu and L. Dr˘agut¸, “Random forest in remote sensing: A review of applications and future directions,” ISPRS journal of photogrammetry and remote sensing, vol. 114, pp. 24–31, 2016.
  • M. Mursalin, Y. Zhang, Y. Chen, and N. V. Chawla, “Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier,” Neurocomputing, vol. 241, pp. 204–214, 2017.
  • H. Chen, Z. Lin, H. Wu, L. Wang, T. Wu, and C. Tan, “Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 135, pp. 185–191, 2015.
  • C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
  • T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp.785–794.
  • O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018. 15
There are 30 citations in total.

Details

Primary Language English
Subjects Computer Software
Journal Section Research Article
Authors

Anıl Utku 0000-0002-7240-8713

Ümit Can 0000-0002-8832-6317

Publication Date December 31, 2022
Submission Date July 25, 2022
Published in Issue Year 2022 Volume: 11 Issue: 4

Cite

IEEE A. Utku and Ü. Can, “Machine Learning-Based Effective Malicious Web Page Detection”, IJISS, vol. 11, no. 4, pp. 28–39, 2022.