AI-Based Phishing Detection System

Authors: Shaik Dharvesh Abbas, Chintha Karthikeya, Pragada Dyva Sagar, Saka Sri Praveen

DOI Link: https://doi.org/10.22214/ijraset.2026.80606

Abstract

Phishing attacks remain one of the most critical cybersecurity threats, exploiting user trust through deceptive emails, malicious URLs, and fake web interfaces. Traditional detection approaches such as blacklist-based systems are ineffective against newly emerging and dynamically generated phishing attacks. This paper proposes an intelligent phishing detection framework that integrates machine learning and natural language processing techniques for improved accuracy and adaptability. The system analyzes a combination of URL-based, content-based, and domain-related features to identify malicious patterns. Multiple supervised learning models, including Random Forest, Support Vector Machine, and Logistic Regression, are evaluated using standard performance metrics. Experimental results demonstrate that the proposed hybrid approach achieves high detection accuracy while reducing false positives, making it suitable for real-time cybersecurity applications

Introduction

The rapid growth of internet services and digital communication has increased the exchange of sensitive information online, making users more vulnerable to cyber threats. Among these threats, phishing attacks are one of the most common and harmful forms of cybercrime. In phishing attacks, attackers impersonate trusted organizations or individuals through fraudulent emails, websites, or URLs to steal confidential information such as passwords, banking details, and personal data.

Traditional phishing detection methods, such as blacklist-based filtering and rule-based systems, are effective only against known threats and struggle to detect newly created or obfuscated phishing attacks. To address these limitations, machine learning and Natural Language Processing (NLP) techniques have emerged as powerful solutions. These approaches can analyze large datasets, identify hidden patterns, and detect suspicious behavior even in previously unseen phishing attempts.

Proposed System

The paper proposes a hybrid phishing detection system that combines:

Machine Learning algorithms for classification.
URL-based feature analysis (e.g., URL length, special characters, HTTPS usage, subdomains).
Content-based analysis using NLP to identify deceptive language patterns in emails and web pages.

By integrating structural and textual features, the system improves phishing detection accuracy while reducing false positives. The framework is designed to be adaptive, scalable, and suitable for real-time deployment.

Literature Review

Previous phishing detection approaches evolved from:

Blacklist and heuristic-based methods, which require constant updates and cannot detect new attacks.
Feature-based machine learning methods, using algorithms such as Decision Trees, Support Vector Machines (SVM), and Naïve Bayes.
Ensemble learning techniques, particularly Random Forest, which improve classification performance and reduce overfitting.
NLP-based approaches, which analyze email content, writing styles, and suspicious keywords.
Deep learning models, including CNNs and LSTMs, which capture complex data patterns but often face challenges related to computational complexity and dataset requirements.

System Architecture

The proposed system consists of several modules:

Data Collection Module
- Collects labeled phishing and legitimate samples from public datasets.
Preprocessing Module
- Removes missing values, duplicates, and irrelevant information.
- Standardizes text through normalization techniques.
Feature Extraction Module
- Extracts:
  - URL-based features (length, IP usage, special characters).
  - Content-based features (keywords, text patterns, HTML structure).
  - Network-based features (domain and DNS information).
Classification Module
- Uses supervised machine learning models to classify websites or messages as phishing or legitimate.
Feedback Mechanism
- Continuously updates the model with newly identified phishing samples to improve adaptability.

Methodology

The system follows a structured workflow:

Dataset collection and preprocessing.
Extraction of URL and content-based features.
Splitting data into training and testing sets.
Training and comparing multiple algorithms:
- Random Forest
- Logistic Regression
- Support Vector Machine (SVM)
Evaluating performance using:
- Accuracy
- Precision
- Recall
- F1-score
Deploying the best-performing model for real-time phishing detection and alert generation.

Experimental Evaluation

The system was tested using the UCI Phishing Websites Dataset, with 80% of data used for training and 20% for testing. Random Forest, SVM, and Logistic Regression models were compared to identify the most effective approach for phishing detection.

Conclusion

This paper presented a hybrid machine learning-based phishing detection system that combines URL analysis with content-based feature extraction using NLP techniques. The proposed model addresses the limitations of traditional detection approaches by improving adaptability and detection accuracy. Experimental results demonstrate that the system achieves high performance across multiple evaluation metrics, making it suitable for real-time deployment. Future work may focus on incorporating deep learning techniques and real-time browser integration to further enhance system effectiveness and scalability.

References

[1] M. Jakobsson and S. Myers, Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft, Wiley, 2006. [2] A. Bergholz et al., \"New filtering approaches for phishing email,\" Journal of Computer Security, vol. 18, no. 1, pp. 7–35, 2010. [3] J. Ma et al., \"Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs,\" Proc. ACM SIGKDD, 2009. [4] R. Verma and A. Das, \"What\'s in a URL: Fast Feature Extraction and Malicious URL Detection,\" Proc. ACM BADGERS, 2017. [5] S. Garera et al., \"A Framework for Detection and Measurement of Phishing Attacks,\" Proc. ACM Workshop on Recurring Malcode, 2007. [6] F. Salahdine, Z. El Mrabet, and N. Kaabouch, \"Phishing Attacks Detection Using Machine Learning,\" 2022. [7] D. Sahoo et al., \"Malicious URL Detection Using Machine Learning: A Survey,\" 2017. [8] V. Shahrivari et al., \"Phishing Detection Using Machine Learning Techniques,\" 2020. [9] R. Jayaraj et al., \"Intrusion Detection Based on Phishing Detection Using Machine Learning,\" 2024. [10] A. Daud et al., \"Phishing Website Detection Using Deep Learning Models,\" 2023.

Copyright

Copyright © 2026 Shaik Dharvesh Abbas, Chintha Karthikeya, Pragada Dyva Sagar, Saka Sri Praveen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET80606

Publish Date : 2026-04-20

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here