A Novel Machine Learning Approach for Malicious URL Classification

Authors: P. Sukumar Reddy, K. Balakrishna Maruthiram

DOI Link: https://doi.org/10.22214/ijraset.2025.71858

Abstract

Exponential rise in cyber threats has made malicious URL detection an essential component of modern cybersecurity systems. Malicious URLs are frequently used to initiate phishing attacks, spread malware, or deface legitimate websites, posing severe risks to users and organizations. In this research, we propose a hybrid machine learning approach for the efficient classification of URLs into four categories: benign, phishing, malware, and defacement. The system leverages both traditional machine learning algorithms and advanced deep learning techniques. Lexical features such as URL length, the number of special characters, and HTTPS presence are extracted and used to train classifiers including Decision Tree, Random Forest, and XGBoost. In parallel, a Character-level Convolutional Neural Network (Char-CNN) is developed to process raw URL strings, enabling automatic feature extraction at the character level.Alabeled dataset containing over 599,359real-world URLs is used for training and evaluation. The models are assessed using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that the Char-CNN model achieves superior accuracy compared to traditional models, while Random Forest and XGBoost demonstrate robust and interpretable performance. The combination of feature-based and character-level models ensures both high detection accuracy and adaptability to evolving attack patterns. The proposed system can be effectively integrated into real-time web security applications to detect and block malicious URLs with high reliability

Introduction

In today's digital world, malicious URLs are a major cyber threat used for phishing, malware distribution, and website defacement. Traditional detection methods like blacklisting and rule-based systems fail to catch novel or obfuscated threats. This project proposes a hybrid system that combines Machine Learning (ML) and Deep Learning (DL) to accurately detect and classify malicious URLs.

Key Components of the Proposed System

???? 1. ML Models with Lexical Features

Algorithms: Decision Tree (DT), Random Forest (RF), XGBoost
Input: Manually engineered features from the URL string such as:
- Length of URL
- Number of special characters (e.g., @, -, ?)
- Use of HTTPS
- Presence of IP address
Benefits: Fast, interpretable, and suitable for real-time detection.

???? 2. Character-Level CNN (Char-CNN)

Deep learning model that analyzes raw URL strings at the character level.
Learns URL patterns automatically using:
- Embedding layer
- Convolutional + pooling layers
- Dense output layer
Advantage: Effective at detecting obfuscated or unseen URLs, no manual feature extraction needed.

Dataset & Preprocessing

Source: Public dataset from Kaggle (~599,000 labeled URLs)
Classes: Benign, phishing, malware, defacement
Training subset: 250,000 entries (with class imbalance)
Preprocessing:
- Feature normalization for ML models
- Tokenization, padding, and embeddings for DL model

Model Evaluation

Metrics: Accuracy, Precision, Recall, F1-score, and Confusion Matrix
Results:
- Char-CNN achieved >99% accuracy, excelling at detecting complex threats.
- XGBoost and Random Forest also showed strong performance and interpretability.
Grid search was used to tune ML models; Char-CNN trained over multiple epochs.

System Architecture

Modular and parallel architecture with four core stages:
1. Input: Raw URLs
2. Preprocessing: Normalization, tokenization, feature extraction
3. Model Processing: ML and DL models applied in parallel
4. Prediction Output: URL classified into one of four categories

Key Contributions

Combines strengths of fast, interpretable ML models with powerful, generalizable DL models
Enables real-time URL prediction for practical cybersecurity applications
Supports integration with browsers, email filters, and firewalls

Conclusion

The rise in web-based services has also brought a significant increase in cyber threats, particularly through malicious URLs used for phishing, malware, and defacement. Traditional detection methods like blacklists and rule-based systems struggle to keep up with evolving threats. To overcome these limitations, this project developed a hybrid malicious URL detection system combining machine learning (ML) and deep learning (DL) techniques.The goal was to build a multi-class classifier to detect benign, phishing, malware, and defacement URLs with high accuracy. The system uses classical ML algorithms such as Decision Tree, Random Forest, and XGBoost, along with a Character-level Convolutional Neural Network (Char-CNN) that processes raw URLs directly.Lexical features were extracted from the URLs—such as length, special character counts, HTTPS usage, and presence of IP addresses. These features helped the ML models classify URLs effectively. Among the ML models, Random Forest and XGBoost performed best, with XGBoost offering the highest precision and scalability.The Char-CNN model eliminated the need for manual feature extraction and showed exceptional accuracy (over 99%) in detecting obfuscated and novel URLs. The combination of ML and DL enabled the system to balance speed, accuracy, and adaptability.Model performance was evaluated using accuracy, precision, recall, and F1-score. The ML models achieved over 91% accuracy, while the Char-CNN model surpassed 99%. The models were trained and tested on real-world data from Kaggle and exported using Joblib and TensorFlow, making them ready for real-time applications like browser plugins, email filters, or network firewall

References

[1] Sahoo, Doyen, Chenghao Liu, and Steven CH Hoi. \"Malicious URL detection using machine learning: a survey. CoRR abs/1701.07179 (2017).\" (2017). [2] O’Gorman, Brigid, et al. \"Internet security threat report.\" A Report published by SYMANTEC 24 (2019): 32. [3] Khonji, Mahmoud, Youssef Iraqi, and Andrew Jones. \"Phishing detection: a literature survey.\" IEEE Communications Surveys & Tutorials 15.4 (2013): 2091-2121. [4] Ma, Justin, et al. \"Learning to detect malicious urls.\" ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 1-24. [5] Choi, Hyunsang, Bin B. Zhu, and Heejo Lee. \"Detecting malicious web links and identifying their attack types.\" 2nd USENIX Conference on Web Application Development (WebApps 11). 2011. [6] Zhiwang, Cen, XuJungang, and Sun Jian. \"A multi-layer bloom filter for duplicated URL detection.\" 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE). Vol. 1. IEEE, 2010. [7] Sheng, Steve, Brad Wardman, Gary Warner, Lorrie Cranor, Jason Hong, and Chengshan Zhang. \"An empirical analysis of phishing blacklists.\" (2009). [8] Mohammad, Rami M., FadiThabtah, and Lee McCluskey. \"An assessment of features related to phishing websites using an automated technique.\" 2012 international conference for internet technology and secured transactions. IEEE, 2012. [9] Zhao, Peilin, and Steven CH Hoi. \"Cost-sensitive online active learning with application to malicious URL detection.\" Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013. [10] Kinder, J., Katzenbeisser, S., Schallhart, C. and Veith, H., 2005, July. \"Detecting malicious code by model checking.\" In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (pp. 174-187). Springer, Berlin, Heidelberg. [11] Ma, Justin, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. \"Identifying suspicious URLs: an application of large-scale online learning.\" In Proceedings of the 26th annual international conference on machine learning, pp. 681-688. 2009. [12] Garera, Sujata, et al. \"A framework for detection and measurement of phishing attacks.\" Proceedings of the 2007 ACM workshop on Recurring malcode. 2007. [13] Patil, Dharmaraj R., and Jayantro B. Patil. \"Malicious URLs detection using decision tree classifiers and majority voting technique.\" Cybernetics and Information Technologies 18, no. 1 (2018): 11-29. [14] Vinayakumar, R., K. P. Soman, and PrabaharanPoornachandran. \"Evaluating deep learning approaches to characterize and classify malicious URL’s.\" Journal of Intelligent & Fuzzy Systems 34.3 (2018): 1333-1343. [15] Darling, Michael, Greg Heileman, GiladGressel, Aravind Ashok, and PrabaharanPoornachandran. \"A lexical approach for classifying malicious URLs.\" In 2015 international conference on high performance computing simulation (HPCS), pp. 195-202. IEEE, 2015. [16] Menon, R. R., Kaartik, J., Nambiar, E. K., TK, A. K., Kumar, A. (2020, June). \"Improving ranking in document based search systems.\" 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184) (pp. 914-921). IEEE. [17] Menon, R.R.K., AkhilDev, R., Bhattathiri, S.G., \"An insight into the relevance of word ordering for text data analysis.\" 2020 fourth international conference on computing methodologies and communication (ICCMC). IEEE, 2020. [18] Srinivasan, S., Vinayakumar, R., Arunachalam, A., Alazab, M., Soman, K. P. (2021). \"DURLD: Malicious URL detection using deep learning-based character level representations.\" Malware analysis using artificial intelligence and deep learning, 535-554. [19] Vazhayil, Anu, R. Vinayakumar, and K. P. Soman. \"Comparative study of the detection of malicious URLs using shallow and deep networks.\" 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 2018

Copyright

Copyright © 2025 P. Sukumar Reddy, K. Balakrishna Maruthiram. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET71858

Publish Date : 2025-05-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here