Abstract
Nowadays, phishing web sites have been increased. The purpose of these sites is to obtain benefits by acquiring personal information of people in general. Our identity and password information in our social media accounts and identity and address information on shopping sites are our personal information. If such information is received by unwanted people, it can have bad unpredictable consequences. In addition, the fact that we carry out a significant portion of our financial transactions such as our online banking transactions on the internet constitutes an important problem in terms of protection from such sites. For this purpose, antivirus software companies, browsers, search engines are working to protect users from such harmful sites in terms of providing better user service and satisfaction. In addition, the detection and prevention of fake web pages before the users is an important area of work in today's artificial intelligence studies. The easiest method of protecting these fraudulent sites in the internet environment where billions of people are browsing every day will be to detect and block fake web pages automatically. Machine learning classification algorithms are automatically identified as fake or real by the system by looking at the information of a page and this is one of the important advantages offered by artificial intelligence studies. With this study, using 10 properties determined for a website address; it is attempted to determine whether this address is a fake or a real address. The data used in this study were taken from Machine Learning Repository (UCI). Data analysis was performed based on the Cross Industry Standard Process Model (CRISP-DM). In the data set, it is labeled as the attribute that determines the status of websites (Class, Phishing = -1, Suspicious = 0 and Legitimate = 1). The study was also done by using RStudio analysis with R programming language. The classification algorithms used are Random Forest (RF), Support Vector Machines (SVM), J48, K-Nearest Neighbor (KNN) and Naive Bayes algorithms. The highest accuracy performance was obtained by Random Forest algorithm.