Cyber deception in the form of phishing website creation has been identified as one of the most critical challenges in the field of cybersecurity in the contemporary digital era. Phishing is a form of social engineering attacks that tricks users into divulging their valuable login details, bank account information, and identity details by masquerading as genuine online platforms. With the increasing cunningness of attackers and the exponentially growing number of internet users across the globe, rule-based and static blacklists have been proven to be grossly inefficient. This paper presents a supervised machine learning approach for phishing website detection through automated feature learning, such as URL lexicography, graph-based features, and domain-based features. A thoughtfully designed dataset comprising both phishing and genuine URLs is developed, preprocessed, and presented to a range of classification algorithms, such as Random Forest, Support Vector Machine (SVM), and Logistic Regression. The results clearly indicate that ensemble learning models, specifically Random Forest, are substantially more accurate than rule-based models. This paper also touches upon the real-world complexities of adversarial attacks, class imbalance, and the challenge of designing generalizable feature spaces and presents a roadmap for future research on developing adaptive anti-phishing systems.
Introduction
It explains that phishing attacks are a major cybersecurity threat that exploit human trust by creating fake but visually convincing websites to steal sensitive information. Traditional defenses like blacklists are reactive and ineffective against new (“zero-hour”) phishing sites, while heuristic rules often generate false positives. This motivates the use of machine learning models, which can learn patterns from data and generalize to unseen attacks.
The study builds a dataset of 3,000 URLs (balanced between phishing and legitimate) and extracts 15 engineered features from URL structure and domain information (such as URL length, presence of IP addresses, domain age, subdomains, and WHOIS data). These features are used to train several classifiers: Logistic Regression, SVM, Decision Tree, and Random Forest.
The results show:
Logistic Regression: ~91.8% accuracy, but limited by its linear nature
SVM: ~94.2% accuracy, better performance using kernel methods but slower training
Random Forest: ~98.5% accuracy, the best performance due to ensemble learning and robustness
Conclusion
This study set out to evaluate whether supervised machine learning could provide a reliable, scalable solution to a cybersecurity problem that has persistently eluded rule-based remedies. The experimental evidence is unambiguous: machine learning classifiers, and ensemble methods in particular, are highly capable of distinguishing phishing URLs from legitimate ones across a rich, multi-dimensional feature space. Among the three algorithms benchmarked, Random Forest emerged as the clear frontrunner, achieving 98.5% classification accuracy — a figure that compares favourably with published results across the phishing detection literature.
Beyond raw accuracy, the study contributes a structured analysis of the feature categories most informative for detection: URL structural anomalies, domain registration metadata, and third-party reputation signals each provide complementary evidence that collectively enables robust classification. The findings affirm that no single feature or feature category is sufficient in isolation; it is the integration of diverse signals through an expressive model that delivers reliable performance across the full spectrum of phishing strategies currently employed by adversaries.
References
[1] R. Kiruthiga and D. Akila, \"Phishing Websites Detection using Machine Learning,\" International Journal of Recent Technology and Engineering (IJRTE), vol. 8, no. 2, pp. 111–114, 2019.
[2] A. D. Kulkarni and L. L. Brown III, \"Phishing Websites Detection using Machine Learning,\" ScholarWorks @ UT Tyler. [Online]. Available: https://scholarworks.uttyler.edu
[3] V. Shahrivari, M. M. Darabi, and M. Izadi, \"Phishing Detection Using Machine Learning Techniques,\" arXiv preprint arXiv:2009.11116, 2020.
[4] Federal Bureau of Investigation, \"2018 Internet Crime Report,\" Internet Crime Complaint Center (IC3), Washington, D.C., 2019.
[5] Microsoft Corporation, \"Microsoft Computing Safer Index Report,\" Redmond, WA, 2014.
[6] D. M. Upadhyaya and M. A. Joshi, \"Phishing – A new face of cyber crime,\" VSRD International Journal of CS & IT, vol. 2, no. 12, pp. 945–951, 2012.