AI Based Web Approach System for Detecting Malicious URLs and Preventing Cyber Fraud

Authors: Prof. Chethana R. M., Shaheeda Kasim Pinjar, Sneha , Tisha Prakash Kavri, Yashaswini S

DOI Link: https://doi.org/10.22214/ijraset.2025.69772

Abstract

Describehowadvancesindeeplearning,especially As the internet becomes more essential in our daily lives, cyber fraud—especially through harmful links—has become a serious issue. This project introduces a smart, web-based system that detects and classifies dangerous URLs in real time, such as phishing, malware, or defacement links. It uses machine learning models trained on a mix of safe and harmful URLs by analyzing features like link structure, special characters, and keywords. Built using Flask, the system provides a simple interface where users can check URLs. It also includes a feedback option, so users can help improve accuracy by confirming or correcting results, which helps the system learn and improve over time. The system can also scrape live webpage content and display the main text in a clean, readable format to help users understand what the page is about. A whitelist of trusted domains helps avoid unnecessary checks on safe websites. The design is light, fast, and easy to expand in the future with features like scanning multiple links, analyzing content tone, or auto-flagging suspicious content. Overall, this system offers a smart, user-friendly, and effective way to fight cyber threats using AI and real-time analysis.

Introduction

1. Background and Motivation

Cyber fraud, particularly through malicious URLs, is a growing threat affecting individuals, businesses, and governments. Traditional detection methods (e.g., blacklists) are slow, outdated, and ineffective against new threats like phishing, malware, redirection, and shortened links.

2. Problem Statement

Attackers disguise harmful URLs to evade detection.
Traditional systems lack real-time capabilities and adaptability.
All sectors (finance, healthcare, education, e-commerce) are vulnerable.
A smart, AI-driven solution is needed for real-time, accurate URL classification.

3. Objectives of the Proposed System

Detect suspicious URLs using structural features (length, special characters, keywords).
Categorize threats into phishing, malware, and defacement.
Use web scraping to analyze actual webpage content.
Offer readable content extraction to help users assess the safety of a site.
Enable user feedback to improve system accuracy.
Provide a Flask-based user interface for ease of use.

4. Literature Review & Limitations of Existing Systems

Blacklist-based systems are fast but outdated.
Heuristic systems detect patterns but are vulnerable to evasion.
Traditional machine learning lacks adaptability.
Web scraping systems may not provide clean content or real-time protection.
Most current systems lack dynamic learning and effective user feedback loops.

5. Proposed System Overview

A real-time AI-powered web app built with Flask, using a Random Forest Classifier to analyze and classify URLs into four categories:

Benign
Phishing
Malware
Defacement

Key Features:

Machine Learning based on URL features (length, symbols, digits, keywords, IP usage, HTTPS presence).
User Feedback Mechanism to correct classifications and retrain the model.
Live Webpage Scraping to analyze real content.
Content Extraction using readability-lxml for clean, readable display.
Trusted Domain Whitelist to reduce false alarms.

6. System Architecture

Data Collection: Uses Selenium and scraping tools to gather live data.
Feature Extraction: Structural and behavioral features from URLs and WHOIS data.
Model Training: Random Forest for its robustness and interpretability.
Real-Time Prediction: Deployed via Flask web interface.
Feedback Loop: User inputs stored for periodic model improvement.

7. Dataset and Preprocessing

Dataset: Labeled URLs from Kaggle (Benign, Phishing, Malware, Defacement).
Preprocessing:
- Cleaning malformed URLs.
- Feature engineering for patterns (dots, digits, symbols, suspicious words).
- Label encoding for model compatibility.
- Train-test split (80/20).

8. Technology Stack

Backend: Flask, Python, Random Forest, CSV for feedback.
Frontend: HTML, CSS, Jinja2 templates.
Libraries: BeautifulSoup, readability-lxml for scraping and content extraction.

Conclusion

This project presents a smart and user-friendly AI-based web system that helps detect and prevent cyber fraud by analyzing URLs in real-time. Using machine learning, web scraping, and content extraction, the system can accurately classify URLs as safe or malicious (like phishing or malware) and show users clear, readable webpage content to help them make better decisions. A feedback feature lets users improve the system’s accuracy over time. Overall, it’s a scalable and interactive solution that enhances online safety and lays the groundwork for future upgrades like multilingual support and integration with other security tools.

References

[1] Shumail, A., & Iqbal, Z. (2021)A Complete Review of How URLs Are Classified to Detect Phishing. Published in the International Journal of Computer Applications., 176(4), 7-13. [2] Singh, D., & Sahu, N. (2018).Detection of PhishingWebsites Using URL-BasedFeatures. Proceedings of the International Conference on Information Technology [3] Finkel, H., & Rodriguez, S. (2017). A Review on the Techniques for Website Content Extraction. Journal of Web Engineering, 16(3), 120-135.Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. [4] Zhang, Z., & Guo, Y. (2020). Web Scraping and Data Mining for Online Security Applications. Springer. [5] Python Software Foundation. (2023). BeautifulSoup Documentation. Retrieved from https://www.crummy.com/software/BeautifulSoup/ Python Software Foundation. (2023). Requests Documentation. Source: https://docs.python-requests.org/en/master/Readability-lxml Documentation. (2023). Readability-lxml - A Python Library for [6] Readability. Source: https://readability-lxml.readthedocs.io/en/latest/ Scikit-learn Documentation. (2023). Scikit-learn - Machine Learning in Python. Retrieved from https://scikit-learn.org/stable/ [7] Faizan, A. (2024). Guardians of the Digital Realm: Navigating the Frontiers of Cybersecurity. Integrated Journal of Science and Technology [8] Malatji, M., & Tolah, A. (2024). Artificial intelligence (AI) cybersecurity dimensions: a comprehensive framework for understanding adversarial and offensive AI. AI and Ethics, 1-28 [9] Liu, R., Wang, Y., Xu, H., Qin, Z., Liu, Y., & Cao, Z. (2023). Malicious URL Detection via Pretrained Language Model Guided Multi-Level Feature Attention Network. arXiv preprint arXiv:2311.12372 [10] Abad, S., Gholamy, H., & Aslani, M. (2023). Classification of malicious URLs using machine learning. Sensors, 23(18), 7760. [11] Aljabri, M., Altamimi, H. S., Albelali, S. A., Al-Harbi, M., Alhuraib, H. T., Alotaibi, N. K., ... & Salah, K. (2022). Detecting malicious URLs using machine learning techniques: review and research directions. IEEE Access, 10, 121395-121417. [12] Reyes-Dorta, N., Caballero-Gil, P., & Rosa-Remedios, C. (2024). Detection of malicious URLs using machine learning. Wireless Networks, 1-18.

Copyright

Copyright © 2025 Prof. Chethana R. M., Shaheeda Kasim Pinjar, Sneha , Tisha Prakash Kavri, Yashaswini S. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69772

Publish Date : 2025-04-26

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here