Phishing URL Detection Using XGBoost and Custom Feature Engineering

Authors: Prof. P. S. Prasad, Aishwarya Kalamkar, Manasi Nagpure, Neha Vaidya, Pranal Mohadikar, Bhagyashri Tembhurne

DOI Link: https://doi.org/10.22214/ijraset.2025.70277

Abstract

Phishing is a prevalent cyberattack technique that deceives users into revealing sensitive personal and financial information through fake websites. With the exponential growth of online services, phishing attacks have become more sophisticated, necessitating intelligent and automated detection mechanisms. This study introduces a smart phishing URL detection approach that utilizes carefully engineered features—such as lexical patterns, structural elements, and domain-related information—to differentiate between malicious and legitimate web addresses. A custom feature extraction module was developed to parse URLs and retrieve 13+ critical features, including URL length, directory structure, file name characteristics, presence of IP addresses, SSL certificate availability, information about the Autonomous System Number (ASN) and domain registration details, including creation and expiration dates. The extracted features were used to train an Extreme Gradient Boosting (XGBoost) classifier, selected for its superior performance in imbalanced and noisy datasets. The model was developed and fine-tuned using PyCaret, an automated machine learning library that optimizes classification performance using cross-validation and hyperparameter tuning. The trained model achieved strong performance across multiple evaluation metrics, highlighting its reliability and effectiveness in accurately identifying phishing URLs. To enhance usability, a web-based application was developed using FastAPI and HTML/CSS, allowing users to submit a URL and receive instant predictions regarding its legitimacy. The system provides an interpretable and scalable framework for real-time phishing detection, suitable for integration into email filters, browsers, and cybersecurity tools. The results affirm that combining feature engineering with a tuned XGBoost classifier offers an effective and deployable solution to mitigate phishing threats in real-world environments.

Introduction

Introduction

As digital platforms become central to daily life—supporting transactions, communication, and learning—the risk of phishing attacks has grown. Phishing involves tricking users into revealing sensitive data by impersonating trusted entities via fake emails or websites. Traditional detection methods, like blacklists, struggle to identify new, zero-day phishing URLs, prompting the need for smarter, adaptive solutions.

Research Objective

This study proposes an intelligent phishing detection system using machine learning (ML) to analyze and classify URLs in real time. It aims to:

Identify phishing links by analyzing lexical, structural, and domain-specific features.
Use the XGBoost classifier, integrated with FastAPI, to deploy a web-based detection tool.

Literature Review

Prior studies have applied various ML techniques for phishing detection:

URL-based features (length, special characters, redirections)
Supervised learning models like Decision Trees, SVMs, and Random Forests
Advanced approaches including deep learning, reinforcement learning, and hybrid models
Use of domain-specific datasets and real-time detection systems

This research builds upon these methods by combining a refined feature set with an optimized XGBoost model deployed in a real-time web app.

Proposed System

The system integrates the following components:

Feature Extraction Module: Analyzes URL components (e.g., domain age, SSL use, redirection, IP presence).
XGBoost Classifier: Trained using PyCaret for efficient hyperparameter tuning and evaluation.
FastAPI Backend: Enables real-time predictions via a web interface.
Frontend Interface: A simple HTML/CSS-based UI for URL submission and instant feedback.

Methodology

A. Dataset

88,647 URLs: ~35% phishing, ~65% legitimate.
Sourced from open and verified phishing datasets.
Balanced with careful use of precision, recall, and AUC metrics to address class imbalance.

B. Data Cleaning & Feature Reduction

From 112 raw features, reduced to 14 key features to improve efficiency and accuracy.
Removed low-variance and redundant features based on correlation and relevance.

C. Feature Engineering

Critical features extracted:

URL/domain/file length
Use of IP addresses
Presence of email in URL
SSL certificate and redirection behavior
Whois metadata (domain age, expiration)
ASN/IP details

D. Feature Selection

Used correlation heatmaps and feature importance analysis (via XGBoost) to remove redundant features.
Final selection ensured minimal multicollinearity and high predictive value.

E. Model Training via PyCaret

Automated preprocessing (imputation, scaling, encoding).
Used 10-fold Stratified Cross-Validation.
Evaluated with multiple metrics: Accuracy, Precision, Recall, F1-Score, AUC.
Dataset split: 80% training, 20% testing.

System Deployment

Model saved as xgb.pkl and integrated with FastAPI.
Web interface allows real-time user input of URLs and outputs whether the link is phishing or legitimate.
Uses libraries like NumPy, Pandas, Scikit-learn, PyCaret, Matplotlib, and Socket/Whois APIs.

Conclusion

This study presents an effective and deployable solution for detecting phishing URLs using machine learning techniques, with a strong emphasis on feature engineering, model optimization, and real-time deployment. The system could precisely distinguish between legitimate and malicious URLs by extracting and analyzing a comprehensive set of lexical, structural, and domain-related features. The XGBoost classifier outperformed the other methods in terms of overall performance, with an accuracy of 97.27% and an AUC of 0.9957. These results confirm its robustness and suitability for security-critical applications. The selected features—particularly those related to domain behavior and URL structure—proved highly predictive, validating their importance in phishing detection. Furthermore, the integration of the trained model into a FastAPI-based web application illustrates the system\'s real-world applicability. The deployed solution enables users to receive phishing predictions instantly, making it useful for email filtering, web browser plugins, and cybersecurity gateways. In summary, it demonstrates that with thoughtful feature selection, proper model tuning, and deployment architecture, it is possible to build a high-performance phishing detection system that is both accurate and scalable.

References

[1] Shraddha Parekh, Dhwanil Parikh, Srushti Kotak, Smita Sankhe, “A New Method for Detection of Phishing Websites: URL Detection,” IEEE Xplore Compliant Conference, Part Number: CFP18BAC-ART; ISBN: 978-1-5386-1974-2, 2018. [2] “Feature Selection for Machine Learning Based Detection of Phishing Websites,” IEEE, 2017. [3] Feature Selection for Machine Learning Based Detection of Phishing Websites,” IEEE, Available: http://ieeexplore.ieee .org/abstract/document/8090317/?reload=true. [4] Sanjukta Mohanty, “Predicting Phishing URL Using Filter Based Univariate Feature Selection Technique,” IEEE Conference, 2022. [5] V. S. Lakshmi, M. S. Vijaya, “Efficient Prediction of Phishing Websites using Supervised Learning Algorithms,” Procedia Engineering, vol. 30, pp. 798–805, 2012. [6] Hasane Ahammad Shaik, “Phishing URL Detection Using Machine Learning Methods,” ResearchGate, Jan. 2022. [7] D. Sahoo, “Malicious URL Detection Using Machine Learning: A Survey,” 2022. [8] Upendra Shetty D. R., Anusha Patil, Mohana, “Malicious URL Detection and Classification Analysis using Machine Learning Models,” IEEE Xplore, Part Number: CFP23CV1-ART; ISBN: 978-1-6654-7451-1, 2023. [9] Aniket Garje, Namrata Tanwani, SammedKandale, Twinkle Zope, Sandeep Gore, “Detecting Phishing Websites Using Machine Learning,” International Journal of Creative Research Thoughts (IJCRT), vol. 9, no. 11, pp. –, Nov. 2021, ISSN: 2320-2882. [10] Rakesh Verma, et al., “What’s in a URL: Fast Feature Extraction and Malicious URL Detection,” Proceedings of the Seventh ACM Conference on Data and Application Security and Privacy, pp. 55–63, 2017. [11] Dipayan Sinha, Dr. Minal Moharir, Prof. Anitha Sandeep, “Phishing Website URL Detection using Machine Learning,” International Journal of Advanced Science and Technology, vol. 29, no. 3, pp. 2495–2504, 2020. [12] Sri Hari Nallamala, Dr. Pragnyaban Mishra, Dr. Suvarna Vani Koneru, “Breast Cancer Detection using Machine Learning Way,” International Journal of Recent Technology and Engineering (IJRTE), vol. 8, issue 2S3, pp. –, July 2019, ISSN: 2277-3878. [13] Sri Hari Nallamala, Dr. Pragnyaban Mishra, Dr. Suvarna Vani Koneru, “Pedagogy and Reduction of K-NN Algorithm for Filtering Samples in the Breast Cancer Treatment,” International Journal of Scientific & Technology Research (IJSTR), vol. 8, issue 11, pp. –, Nov. 2019, ISSN: 2277-8616. [14] N. B. Naidu, Sri Hari Nallamala, Chukka Swarna Lalitha, Syed Seema Anjum, “Pertaining Formal Methods for Privacy Protection,” International Journal of Grid & Distributed Computing, vol. 13, no. 1, pp. –, Mar. 2020, ISSN: 2005-4262. [15] Kranthi Madala, Sushma Chowdary Polavarapu, Sri Hari Nallamala, “Automatic Signal Indication System through Helmet,” International Journal of Advanced Science and Technology, vol. 29, no. 5, pp. –, Apr./May 2020, ISSN: 2005-4238. [16] Sri Hari Nallamala, Dr. D. Durga Prasad, J. Ranga Rajesh, Dr. Pragnaban Mishra, Sushma Chowdary P, “A Review on Applications, Early Successes & Challenges of Big Data in Modern Healthcare Management,” TEST Engineering and Management Journal, vol. 83, issue 3, pp. –, May–June 2020, ISSN: 0193-4120. [17] K. B. Prakash, Rama Krishna E, N. B. Naidu, Sri Hari Nallamala, Dr. Pragyaban Mishra, P. Dharani, “Accurate Hand Gesture Recognition using CNN and RNN Approaches,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 3, pp. –, May–June 2020, ISSN: 2278-3091. [18] Sushma Chowdary P, Kranthi Madala, M. Sailaja, Sri Hari Nallamala, “Investigation on IoT System Design & Its Components,” Journal of Advanced Research in Dynamical & Control Systems, vol. 12, issue 6, pp. –, June 2020, ISSN: 1943-023X. [19] Sri Hari Nallamala, Bajjuri Usha Rani, Anandarao S, Dr. Durga Prasad D, Dr. Pragnyaban Mishra, “A Brief Analysis of Collaborative and Content-Based Filtering Algorithms Used in Recommender Systems,” IOP Conference Series: Materials Science and Engineering, vol. 981, no. 2, 022008, Dec. 2020, ISSN: 1757-899X. [20] Manukonda Vinay, Gonugunta B. S. Venkatesh, Malempati V. Priyanka, Dogiparthi V. Sai, Dr. Sri Hari Nallamala, “Deep Learning Based Face Mask Detection for User Safety from COVID-19,” International Journal of Innovative Research in Computer and Communication Engineering (IJIRCCE), vol. 10, issue 5, pp. –, May 2022, e-ISSN: 2320-9801, p-ISSN: 2320-9798. [21] P. R. Vyshnavi, M. V. N. S. Niharika, M. Summayya, P. Pravallika, Dr. Sri Hari Nallamala, “Liver Disease Prediction Using Machine Learning,” International Journal of Innovative Research in Science, Engineering and Technology (IJIRSET), vol. 11, issue 6, pp. –, June 2022, e-ISSN: 2319-8753, p-ISSN: 2320-6710. [22] Y. Vineela Devi, T. Akshara, S. Mohitha, V. Venkatesh, N. Sri Hari, “Precision Farming by Analysing Soil Moisture and NPK Using Machine Learning,” IJIRSET, vol. 11, issue 6, pp. –, June 2022. [23] Dr. N. Sri Hari, M. Ramya Sri, Mythri P., N. Sai Harshitha, M. V. N. S. Kumar, “Detection of COVID-19 Using Deep Learning,” IJFANS International Journal of Food and Nutritional Science, vol. 11, issue 12, pp. –, Dec. 2022, P-ISSN: 2319-1775, Online-ISSN: 2320-7876. [24] Dr. N. Sri Hari, P. Vanaja, M. Ajay Kumar, M. D. V. S. Akash, K. Sivaiah, “Multi Disease Detection Using Machine Learning,” IJFANS International Journal of Food and Nutritional Science, vol. 11, issue 12, pp. –, Dec. 2022. [25] Dr. N. Sri Hari, Shaik Nelofor, Siramdasu L. Vardhan, Sura R. P. Reddy, Sakhamuri Devendra, “CycleGAN Age Regressor,” International Journal for Innovative Engineering and Management Research, vol. 12, issue 4, pp. 45–51, Apr. 2023, ISSN: 2456-5083. [26] Sudheer Mangalampalli et al., “Fault-Tolerant Trust-Based Task Scheduling Algorithm Using Harris Hawks Optimization in Cloud Computing,” Sensors, vol. 23, no. 18, 8009, 2023. DOI: https://doi.org/10.3390/s23188009 [27] K. Sudharson et al., “Hybrid Quantum Computing and Decision Tree Based Data Mining for Improved Data Security,” 7th Int. Conf. on Computing, Communication, Control and Automation (ICCUBEA), Aug. 2023. IEEE, ISBN: 979-8-3503-0426-8. [28] G. S. Gandhi, K. Vikas, V. Ratnam, K. Suresh Babu, “Grid Clustering and Fuzzy Reinforcement Learning Based Energy-Efficient Data Aggregation Scheme for Distributed WSN,” IET Communications, vol. 14, no. 16, pp. 2840–2848. [29] K. V. Prasad, G. S. Gandhi, S. Balaji, “Inexpensive Colour Image Segmentation by Using Mean Shift Algorithm and Clustering,” Int. J. of Graphics and Image Processing, vol. 4, no. 4, pp. 260–266. [30] P. S. K. V. Maddumala, S. G. Gundabatini, P. Anusha, “Classification of Cancer Cells Detection Using Machine Learning Concepts,” Int. J. of Advanced Science and Technology, vol. 29, no. 3, pp. 9177–9190. [31] S. G. Gundabatini, S. B. Kolluru, C. H. V. Ratnam, N. N. Krupa, “DAAM: WSN Data Aggregation Using Enhanced AI and ML Approaches,” Lecture Notes in Electrical Engineering (LNEE), vol. 976, June 2023. [32] S. G. Gundabatini, E. Rayachoti, R. Vedantham, “Recurrent Residual Puzzle-Based Encoder Decoder Network (R2-PED) Model for Retinal Vessel Segmentation,” Multimedia Tools and Applications, 2023. https://doi.org/10.1007/s11042-023-16765-0 [33] S. G. Gundabatini, E. Rayachoti, R. Vedantham, “EU-net: An Automated CNN Based Ebola U-net Model for Efficient Medical Image Segmentation,” Multimedia Tools and Applications, 2024. https://doi.org/10.1007/s11042-024-18482-8 [34] M. Sánchez-Paniagua, E. Fidalgo Fernández, V. González-Castro, E. Alegre, W. Al-Nabki, “Phishing URL Detection: A Real-Case Scenario Through Login URLs,” IEEE Access, Apr. 2022. [35] Dataset: “Phishing Website Detector,” [Online]. Available: https://www.kaggle.com/eswarchandt/phishing-website-detector [36] M. Chatterjee, A. Siami Namin, “Detecting Phishing Websites through Deep Reinforcement Learning,” 2019 IEEE 43rd Annual COMPSAC, pp. –, 2019. [37] S. C. Jeeva, E. B. Rajsingh, “Intelligent Phishing URL Detection Using Association Rule Mining,” Human-centric Computing and Information Sciences, vol. 6, no. 10, 2016. https://doi.org/10.1186/s13673-016-0064-3 [38] M. Sánchez-Paniagua et al., “Impact of Current Phishing Strategies in Machine Learning Models for Phishing Detection,” in 13th Int. Conf. on Computational Intelligence in Security for Information Systems (CISIS 2020), Springer, vol. 1267, pp. –, 2021. [39] P. C. R. Chinta, C. S. Moore, L. M. Karaka, M. Sakuru, V. Bodepudi, S. R. Maka, “Building an Intelligent Phishing Email Detection System Using Machine Learning and Feature Engineering,” Eur. J. Appl. Sc. Eng. Technol., vol. 3, no. 2, pp. 41–54, Mar.–Apr. 2025. DOI: 10.59324/ejaset.2025.3(2).04 [40] E. Gandotra, D. Gupta, “An Efficient Approach for Phishing Detection Using Machine Learning,” in Multimedia Security, Springer, 2021. https://doi.org/10.1007/978-981-15-8711-5_12 [41] A. Begum, S. Badugu, “A Study of Malicious URL Detection Using Machine Learning and Heuristic Approaches,” in ICETE 2019, Springer, vol. 4, pp. –, 2020. https://doi.org/10.1007/978-3-030-24318-0_68 [42] R. Patgiri, H. Katari, R. Kumar, D. Sharma, “Empirical Study on Malicious URL Detection Using Machine Learning,” in ICDCIT 2019, Springer, vol. 11319, pp. –, 2019. https://doi.org/10.1007/978-3-030-05366-6_31 [43] Sri Hari Nallamala, Kommu Namitha, Kunchanapalli Raviteja, Kadiyam Sai Sumanth, Jyothi Sri Kota, “Phishing Website Detection Using Machine Learning,” International Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 12, no. 4, pp. 1387–1392, Apr. 2024. DOI: 10.22214/ijraset.2024.59261

Copyright

Copyright © 2025 Prof. P. S. Prasad, Aishwarya Kalamkar, Manasi Nagpure, Neha Vaidya, Pranal Mohadikar, Bhagyashri Tembhurne. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70277

Publish Date : 2025-05-03

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here

A PHP Error was encountered