In the digital era, phishing attacks pose a significant threat to online security, especially in areas such as e-banking, e-commerce, and information systems. This study focuses on enhancing the detection of phishing websites using advanced Machine Learning (ML) techniques. Phishing attacks typically involve deceiving users by mimicking legitimate websites to steal sensitive information such as usernames, passwords, and financial details. These attacks are often carried out through malicious URLs or cloned webpages that appear authentic to unsuspecting users. Accurately identifying such threats is critical, as phishing remains one of the leading causes of cybersecurity breaches. To address this, the proposed work utilizes supervised ML algorithms, including Random Forest and Decision Tree classifiers, based on extracted URL features such as lexical patterns, domain information, and host-based attributes. The system also integrates techniques for real-time URL analysis and domain verification. Experimental results demonstrate the effectiveness of the approach in accurately classifying phishing and legitimate websites, contributing to the development of intelligent cybersecurity solutions.
Introduction
In the digital age, phishing is a major cybersecurity threat where attackers impersonate trusted websites to steal sensitive data.
Traditional defenses (blacklists, rule-based filters) struggle against zero-day phishing threats.
This project aims to use Machine Learning (ML) techniques to detect phishing websites in real-time based on URL features, improving online safety.
2. Literature Review
Abu-Nimeh et al.: Random Forest outperforms other ML models (Naïve Bayes, SVM, Decision Trees).
Zhang et al.: Lightweight real-time phishing detection using URL lexical features.
Sharma & Gupta: Hybrid model combining BERT (text analysis) and ML (URL structure) for better accuracy.
Tanveer & Al-Turjman: ML optimized for edge/mobile devices using MobileNet.
Patel & Roy: Feature selection is key; Random Forest achieved the best precision/recall across datasets.
3. Existing Systems
Use lexical features (e.g., URL length, special characters, HTTPS).
Host-based features (domain age, IP, SSL validity) via WHOIS data.
Model Training: SVM and Random Forest tested with evaluation metrics (accuracy, precision, recall, F1-score).
Prediction: Uses a confusion matrix; Random Forest favored for its accuracy and feature interpretability.
Algorithm: Random Forest with Gini impurity for splitting and majority voting for final classification.
6. Results
Model successfully trained on legitimate URLs to detect phishing attempts.
System can accurately identify fake URLs and provide alerts for safe browsing.
Conclusion
This work presents an effective machine learning-based approach for detecting phishing websites by analyzing various features extracted from input URLs. The proposed system begins with a detailed feature extraction phase, focusing on lexical characteristics, host-based attributes, and popularity metrics. These features are then processed and used to train a Random Forest classifier on a labeled dataset comprising phishing and legitimate URLs. The classifier demonstrated high accuracy in identifying phishing attempts and was able to provide fast and reliable predictions suitable for real-time applications. The system’s ability to differentiate between malicious and legitimate links makes it a practical tool for integration into security platforms such as browser extensions or email filters. In future work, the model’s performance can be enhanced further by incorporating deep learning models and continuously updating the dataset with evolving phishing tactics to improve adaptability and generalization across diverse attack vectors.
References
[1] Zhang, Y., Liu, J., & Wang, X. (2024). Hybrid Deep Learning Framework for Real-Time Phishing Website Detection. Journal of Cybersecurity and Digital Trust, 6(1), 112–124.
[2] Sharma, R., & Gupta, A. (2023). An Efficient Phishing Detection Model Using BERT and URL Feature Engineering. In 2023 IEEE International Conference on Smart Systems and Machine Learning (ICSSML), pp. 34–39.
[3] Tanveer, M., & Al-Turjman, F. (2023). Lightweight Machine Learning Models for Phishing URL Detection on Edge Devices. Journal of Information Security and Applications, 76, 103767.
[4] Patel, S., & Roy, M. (2022). Comparative Study of URL-based Features for Phishing Website Classification Using Machine Learning. In 2022 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT).
[5] Zhao, H., & Zhang, L. (2022). Detection of Phishing Attacks with Ensemble Learning. In Proceedings of the 2022 International Symposium on Intelligent Computing and Security (ISICS), pp. 55–61.
[6] Alkawaz, M. H., Steven, S. J., & Hajamydeen, A. I. (2020). Detection of Phishing Websites using Machine Learning. In 16th IEEE International Colloquium on Signal Processing and its Applications (CSPA).
[7] Afroz, S., & Greenstadt, R. (2020). PhishZoo: Detecting Phishing Websites by Looking at Them. In Proceedings of the IEEE Fifth International Conference on Semantic Computing (ICSC), pp. 58–65.
[8] Astorino, A., Chiarello, A., Gaudioso, M., & Piccolo, A. (2019). Malicious URL Detection via Spherical Classification. Neural Computing and Applications, 31(12), 9317–9327.
[9] Abu-Nimeh, S., Nappa, D., Wang, X., & Nair, S. (2007). A Comparison of Machine Learning Techniques for Phishing Detection. In Proceedings of the Anti-Phishing Working Group’s 2nd Annual eCrime Researchers Summit (eCrime), pp. 60–69.