Phishing websites represent a pervasive cybersecurity threat in which adversaries construct fraudulent web entities to steal sensitive user information including login credentials, financial data, and personal details. The detection of phishing websites is a challenging problem due to the continuously evolving strategies employed by attackers. This paper proposes a hybrid machine learning architecture for efficient and reliable phishing website detection that integrates Bidirectional Encoder Representations from Transformers (BERT) for textual feature extraction, Graph Neural Networks (GNN) for analyzing structural relationships among web entities, and LightGBM for ensemble classification. The proposed system captures both semantic patterns from URLs and structural information from website data to improve detection accuracy. Features including URL characteristics, domain information, webpage content patterns, and hyperlink structures are extracted and processed for effective binary classification. The system is trained and evaluated using a labeled phishing website dataset and demonstrates improved performance compared to traditional detection methods across accuracy, precision, recall, and F1-score metrics. The ROC-AUC of the hybrid model reaches 0.983, confirming strong discriminative capability. The proposed architecture adapts to evolving phishing strategies and provides a robust solution for identifying malicious websites.
Introduction
This paper addresses the growing threat of phishing websites, which are fraudulent web pages designed to imitate legitimate websites and steal sensitive user information such as passwords, financial details, and personal data. Traditional phishing detection methods, including blacklist-based and rule-based systems, are limited because they rely on previously known malicious websites and cannot effectively detect newly created or sophisticated phishing attacks. Therefore, there is a need for intelligent, adaptive, and data-driven detection approaches.
To overcome these limitations, the study proposes a hybrid machine learning framework that combines three powerful technologies: BERT (Bidirectional Encoder Representations from Transformers) for analyzing URL and webpage text semantics, Graph Neural Networks (GNNs) for modeling relationships among domains, hyperlinks, and web entities, and LightGBM as an ensemble classifier to integrate features from multiple sources. The system utilizes a wide range of features, including URL structure, domain registration information, webpage content, security attributes, and hyperlink network characteristics. Tested on a dataset of 50,000 labeled website samples, the proposed model achieved a ROC-AUC score of 0.983, demonstrating superior phishing detection performance compared to individual baseline models.
The study aims to evaluate the effectiveness of BERT, GNN, and LightGBM for phishing detection, compare the hybrid model with traditional machine learning approaches, identify the most important phishing indicators, and assess performance using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
The literature survey reviews existing phishing detection methods, including blacklist systems, heuristic techniques, machine learning algorithms such as Support Vector Machines (SVM) and K-Nearest Neighbors (KNN), and deep learning approaches including Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and reinforcement learning models. While these methods have improved phishing detection, most fail to simultaneously exploit textual, structural, and tabular website information within a unified framework.
The proposed methodology consists of several stages:
Dataset Collection: A dataset of 50,000 websites is gathered from sources such as PhishTank, OpenPhish, Alexa, Tranco, and Kaggle, containing both phishing and legitimate websites.
Data Preprocessing: Missing values, duplicates, and outliers are removed, while URLs are normalized and tokenized for analysis.
Feature Extraction: Features include URL length, special characters, domain age, HTTPS usage, SSL validity, external links, website traffic rank, and webpage behavior patterns.
Data Splitting: The dataset is divided into training and testing sets using an 80:20 ratio, with five-fold cross-validation for model validation.
Feature Selection: Correlation analysis, Principal Component Analysis (PCA), and SHAP techniques are used to reduce dimensionality and identify the most important features.
The system architecture consists of five modules: Input URL Processing, Data Preprocessing, Feature Extraction, Hybrid Model Inference, and Prediction Output. User-provided URLs are processed and transformed into meaningful features, which are analyzed by the combined BERT-GNN-LightGBM model to produce a phishing or legitimate classification along with a confidence score.
Conclusion
This paper presents a hybrid phishing website detection system integrating BERT, Graph Neural Networks, and LightGBM within a unified inference pipeline. The architecture addresses the key limitations of existing detection approaches by jointly modeling textual URL semantics, structural inter-domain relationships, and tabular website features. Evaluated on a dataset of 50,000 labeled samples, the proposed system demonstrates strong detection performance with a ROC-AUC of 0.983, confirming the effectiveness of multi-modal feature fusion for phishing classification.
The work highlights the importance of addressing imbalanced dataset challenges and selecting meaningful discriminative features as prerequisites for building reliable detection systems. The integration of advanced deep learning techniques with efficient gradient-boosted classification provides a practical and scalable solution for real-world cybersecurity applications. Future work will explore real-time threat intelligence integration, extension to email and social media phishing vectors, and continuous model retraining to maintain effectiveness against zero-day phishing campaigns.
References
[1] APWG, “Phishing Activity Trends Report,” Anti-Phishing Working Group, 2022.
[2] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic classification of phishing pages,” in Proc. Network and Distributed System Security Symposium (NDSS), 2010.
[3] R. Verma and K. Dyer, “On the character of phishing URLs: Accurate and robust statistical learning classifiers,” in Proc. 5th ACM Conf. Data and Application Security and Privacy, 2015.
[4] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: Learning to detect malicious web sites from suspicious URLs,” in Proc. 15th ACM SIGKDD, 2009.
[5] N. Chiew, S. Yong, and C. Tan, “A survey of phishing attacks: Their types, vectors and technical approaches,” Expert Systems with Applications, vol. 106, pp. 1–20, 2018.
[6] H. Abusaimeh, A. Alshahrani, and M. A. Alzain, “An efficient anti-phishing system using support vector machine,” Journal of Information Security, 2014.
[7] M. Kalabarige, P. Rathnayake, and K. Hewage, “Phishing website detection using K-nearest neighbor algorithm,” International Journal of Computer Applications, 2016.
[8] M. Ali and H. Zaharon, “Phishing detection using ensemble learning techniques,” Journal of Information Security and Applications, 2017.
[9] S. Ripa, D. Singh, and R. Dey, “Phishing website detection using recurrent neural networks,” in Proc. IEEE International Conference on Computing, 2019.
[10] J. Sánchez-Paniagua, M. C. Rodríguez-Domínguez, and J. L. Martínez-Romo, “A CNN-based phishing detection system for URLs and emails,” Expert Systems with Applications, 2020.
[11] W. Huang, Q. Qian, and X. Wang, “Deep learning based phishing website detection,” Applied Soft Computing, vol. 85, 2019.
[12] Z. Yang, K. Chen, and J. Xu, “Phishing detection in social media using deep learning,” IEEE Access, vol. 7, pp. 92842–92852, 2019.
[13] Y. Zhou, J. Feng, and Y. Wu, “Phishing detection using deep reinforcement learning,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1–14, 2020.
[14] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019.