At present, phishing websites pose one of the greatest cybersecurity threats on the Internet. These websites are fraudulent sites that try to fool users into sharing passwords, bank details, and other sensitive data by imitating legitimate websites. Since cybercriminals keep coming up with phishing websites, the task of identifying them through traditional means becomes progressively challenging. This is why machine learning can serve as a novel approach for phishing detection. In this research paper, we propose a machine learning model to detect phishing sites based on various attributes associated with the websites and URLs, like URL length, domain age, use of HTTPS, redirections, and presence of suspicious characters in website URLs. Some of the popular machine learning techniques, namely, Random Forest, SVM, Decision Tree, Logistic Regression, and ANN, are used to evaluate their effectiveness in detecting phishing. The experiments have been performed by applying the algorithm to public datasets from reliable sources like Phish Tank and the UCI Machine Learning Repository. It can be observed from the results obtained that machine learning algorithms are capable of effectively identifying phishing websites from legitimate websites with enhanced accuracy. This system has the potential to aid in enhancing cybersecurity strategies and detecting phishing websites in real time.
Introduction
The text describes a machine learning-based phishing website detection system designed to improve cybersecurity by identifying malicious websites more effectively than traditional methods.
It begins by explaining that phishing is a major cyber threat where attackers create fake websites to steal sensitive user data such as passwords and banking credentials. Traditional detection methods like blacklists and rule-based systems are limited because they cannot reliably detect new or unknown phishing sites.
To address this, the study proposes a system that uses machine learning to classify websites as legitimate, suspicious, or phishing based on features extracted from URLs and web behavior. These features include URL length, use of HTTPS, redirects, domain age, subdomains, suspicious characters, DNS records, and traffic-related indicators.
The system is built using technologies such as Python, Flask, Scikit-learn, Pandas, and SQLite. It uses a dataset of over 11,000 website entries and applies preprocessing steps like cleaning, feature selection, and splitting into training and testing sets (80:20 ratio).
Several machine learning models are tested, including Random Forest, SVM, Decision Tree, Logistic Regression, and Artificial Neural Networks, with Random Forest identified as the most effective due to its high accuracy and robustness.
The system architecture includes:
A web-based frontend for user interaction (HTML, CSS, Jinja2)
A Flask backend for processing URLs and predictions
A feature extraction module for analyzing website characteristics
A machine learning engine for classification
A SQLite database for storing scan history and blacklisted URLs
The workflow involves a user submitting a URL, checking if it exists in a blacklist, extracting features if it is new, converting them into vectors, and then predicting its class using the trained model with a confidence score. Results are displayed in real time.
Evaluation is done using standard metrics such as accuracy, precision, recall, F1-score, and confusion matrix.
Conclusion
The phishing website detection system that has been suggested is a very efficient solution for detecting phishing websites through machine learning methods. The system efficiently identifies different characteristics related to the URL and website such as URL length, HTTPS, redirectors, domain age, and other suspicious characters to categorize a website as a legitimate one, a suspicious one, or a phishing site. The Random Forest Classifier provided efficient results and performed well in the prediction phase of the project. In addition to these, the web application allows real-time predictions, scanning history records, blacklist functions, and admin monitoring features
References
[1] Safi, A., & Singh, S., \"A Systematic Literature Review on Phishing Website Detection using Machine Learning Techniques,\" Journal of King Saud University - Computer and Information Sciences, 2023.
[2] Rao, M. A., & Pais, B., \"Phishing Website Detection using Machine Learning Algorithms,\" International Journal of Computer Applications, 2019.
[3] Choudhary, T., & Jain, S., \"A Machine Learning Approach for Phishing Attack Detection,\" Journal of Artificial Intelligence and Technology, 2023.
[4] Ali, W., \"Phishing Website Detection based on Supervised Machine Learning with Wrapper Features Selection,\" International Journal of Advanced Computer Science and Applications, vol. 8, no. 9, 2017.
[5] Gupta, S. D., et al., \"Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques,\" Computers & Security, 2022.
[6] Rehman, A. U., et al., \"Real-Time Phishing URL Detection Using Machine Learning,\" Engineering Proceedings, vol. 107, no. 1, 2025.