A Hybrid Framework for Real-Time Phishing Detection Using URL, Content, and DOM Features with Interpretable ML

Authors: Prof. Divyashree D, Sana Khan, Irshad Ansari

DOI Link: https://doi.org/10.22214/ijraset.2025.70930

Abstract

There is a new wave of increasingly sophisticated phishing, necessitating sophisticated detection systems which combine many aspects for accuracy and real time functionality. To improve detection robustness this research proposes a scalable hybrid system for real-time detection of phishing, taking into consideration URL-based attributes, content attributes, plus DOM structure attributes as a three-dimensional approach to URL-based phishing detection. Our approach incorporates explainable machine learning techniques (SHAP) and ensemble models (SVM-Deep Learning) to provide good accuracy as well as support for security analysts and their decisions. The system uses adaptive learning and dynamic feature selection to remain robust against evolving techniques in phishing. The experimental caried with many datasets, and proposed approach outperforms traditional single-feature approaches while achieving an exceptionally low false positive rate of 0.7%, with 98.3% detection accuracy. This research makes it feasible to connect high-performing Artificial Intelligence with meaningful cybersecurity practicalities, especially regarding a current, relevant, scalable, and real-time response to today\'s phishing attacks, while seamlessly adopting it to web browsers and security gateways.

Introduction

I. Problem Overview

Phishing is one of the most costly and dangerous cyber threats, costing businesses over $4.9 billion annually.
Modern phishing attacks are highly sophisticated, using dynamic URLs, cloaked content, and malicious DOM structures, making them hard to detect using traditional systems such as blacklists and static HTML analysis.
There is a critical need for detection systems that can analyze phishing from multiple angles and adapt to evolving tactics.

II. Proposed Solution: HybridPhishNet

HybridPhishNet is a real-time, hybrid phishing detection framework that combines:
- DOM structure analysis
- HTML/content inspection
- URL-based feature extraction
It uses a hybrid machine learning model (CNN-LSTM + SVM) and integrates explainable AI (XAI) through SHAP to aid analyst understanding and decision-making.
Achieves 98.1% detection accuracy, 0.7% false-positive rate, and <50ms latency.

III. Technical Approach

Data Collection

50,000 total websites (25,000 phishing, 25,000 legitimate).
Sources: APWG, PhishTank, OpenPhish, and Tranco Top 10K.
Used Selenium and Puppeteer for full page rendering and DOM/script extraction.

Feature Engineering

URL Features: Lexical properties, WHOIS data, TLS validity.
HTML/Content Features: TF-IDF, visual similarity (Siamese network), obfuscation indicators (e.g., Base64, iframes).
DOM Features: Interaction graphs of event listeners, DOM updates (e.g., JavaScript document.write, dynamic form injections).

Model Design

CNN-LSTM for DOM sequence modeling.
SVM (with SHAP explainability) for URL and content analysis.
Dynamic ensemble uses confidence-weighted voting to combine SVM and deep learning outputs.

Interpretability

Uses SHAP to identify which features contribute most to phishing predictions (e.g., “domain age < 7 days” increases risk).
Analysts can visualize attention weights and JavaScript behaviors in a dashboard to review alerts.
Improved user trust by 40% compared to black-box models.

IV. Literature and Research Gaps

Prior systems:
- Struggled with dynamic content and DOM manipulation.
- Lacked real-time processing and cross-feature fusion.
- Provided little or no explainability.
HybridPhishNet addresses these gaps through a multi-modal detection system and interpretable ensemble models.

V. Experimental Results

Model	Accuracy	False Positive Rate (FPR)	Latency	Interpretable?
Baseline [8]	92%	2.1%	20 ms	? No
Prior Art [4]	94.2%	1.5%	200 ms	?? Partial
HybridPhishNet	98.1%	0.7%	<50 ms	? Yes

Detects 27% more advanced phishing attacks (e.g., fake overlays).
SHAP dashboard lets analysts verify 95% of alerts.
2.4× faster incident response compared to previous systems.

VI. Limitations & Future Work

Limitations:

Manual intervention may be needed for new JavaScript obfuscation techniques.
Slight language/geographic bias in dataset (mostly English/Latin characters).
Not yet tested against emerging threats like WebAssembly-based phishing.

Future Enhancements:

Use reinforcement learning for adaptive feature updates.
Expand to multilingual datasets for better global coverage.
Implement edge computing (e.g., Raspberry Pi) for low-power, IoT-compatible real-time detection.

Conclusion

HybridPhishNet offers a scalable and effective defense against current phishing threats as it combines high detection capability, real-time detection, and actionable interpretability. Not only does HybridPhishNet build on previous research, but its modular hybrid architecture also allows it to be applied to various enterprise contexts easily and simply. Future work will focus on reinforcement learning, growing multilingual datasets, and deploying to edge-devices, even if the current limitations consist of dealing with non-English phishing content and adapting to new attack vectors. In the end, HybridPhishNet is a leap toward a robust, advanced phishing defense.

References

[1] Ramirez-Thompson, Eric. \"The Measurement of Crime.\" Criminology: Foundations and Modern Applications (2023). [2] SUNDARAM, J. and CISA, I., Analyzing and Adapting Cybersecurity Lessons: Safeguarding Organizations Through Strategic Alignment and Continuous Improvement. [3] Sahingoz, O.K., Buber, E., Demir, O. and Diri, B., 2019. Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, pp.345-357. [4] [6]Bhatia, A. and Kumar, A., 2025. AI Explainability and Trust in Cybersecurity Operations. In Deep Learning Innovations for Securing Critical Infrastructures (pp. 57-74). IGI Global Scientific Publishing. [5] Pourmohamad, R., Wirsz, S., Oest, A., Bao, T., Shoshitaishvili, Y., Wang, R., Doupé, A. and Bazzi, R.A., 2024, July. Deep Dive into Client-Side Anti-Phishing: A Longitudinal Study Bridging Academia and Industry. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security (pp. 638-653). [6] Prakash, S., Rama Krishna, K. and Verma, I., 2024. Security Issues with Social Media Data. Indradeep, Security Issues with Social Media Data (July 03, 2024). [7] Sharma, I. and Sharma, A.K., 2023. Anti-phishing tools: A thorough comparison of features and performance. International Journal for Research in Applied Science and Engineering Technology, 11, pp.478-482. [8] Abdulraheem, R., Odeh, A., Al Fayoumi, M. and Keshta, I., 2022, January. Efficient Email phishing detection using Machine learning. In 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 0354-0358). IEEE. [9] Atlam, H.F. and Oluwatimilehin, O., 2022. Business email compromise phishing detection based on machine learning: A systematic literature review. Electronics, 12(1), p.42. [10] Li, Q., Cheng, M., Wang, J. and Sun, B., 2020. LSTM based phishing detection for big email data. IEEE transactions on big data, 8(1), pp.278-288. [11] Bergholz, A., Chang, J.H., Paass, G., Reichartz, F. and Strobel, S., 2008, August. Improved Phishing Detection using Model-Based Features. In CEAS. [12] Salloum, S., Gaber, T., Vadera, S. and Shaalan, K., 2022. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access, 10, pp.65703-65727. [13] Thakur, K., Ali, M.L., Obaidat, M.A. and Kamruzzaman, A., 2023. A systematic review on deep-learning-based phishing email detection. Electronics, 12(21), p.4545. [14] ?entürk, ?., Yerli, E. and So?ukp?nar, ?., 2017, October. Email phishing detection and prevention by using data mining techniques. In 2017 International Conference on Computer Science and Engineering (UBMK) (pp. 707-712). IEEE. [15] Moizuddin, M.K., Kabeer, M. and Misbahuddin, M., 2024, October. Cyber-Phishing Analysis offering Cyber Security for Social Networks. In 2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS) (pp. 1-5). IEEE.

Copyright

Copyright © 2025 Prof. Divyashree D, Sana Khan, Irshad Ansari. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70930

Publish Date : 2025-05-13

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here