Spam Email Filter with NLP: A Machine Learning Approach

Authors: G. Prudhvi, Dr. U. Sathish, B.Giridhara Reddy, G. Vignesh, D. Vijay

DOI Link: https://doi.org/10.22214/ijraset.2026.81866

Abstract

With the rapid expansion of digital communication, spam emails containing phishing links, fraudulent offers, or malware have become a major security concern for individuals and organizations. The proposed project, Spam Email Filter using NLP, utilizes Natural Language Processing (NLP) and Machine Learning (ML) to automatically identify and filter such malicious emails. The system preprocesses email text through several NLP stages, including tokenization, stopword removal, normalization, and feature extraction using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. These features are then used to train a Naive Bayes classifier, which distinguishes between spam and legitimate (ham) emails based on learned probability distributions. Performance is evaluated using metrics such as accuracy, precision, recall, and F1-score, demonstrating high accuracy and minimal false positives. The model is computationally efficient, adaptable for real-time spam detection, and enhances email security by filtering harmful content. Future improvements may include multilingual support and adaptive learning to handle evolving spam patterns.

Introduction

The document presents “SpamGuard AI,” an intelligent real-time email security system designed to detect spam and phishing emails using machine learning. With the increasing volume of malicious emails worldwide, traditional rule-based filters are no longer effective against modern attacks that use obfuscation and social engineering. To address this, the proposed system uses Natural Language Processing (TF-IDF) and a Random Forest classifier to analyze email content and classify messages as legitimate, spam, or phishing with high accuracy.

The system integrates directly with Gmail using the Gmail API and provides real-time email scanning through a secure architecture built with FastAPI (backend) and Flutter (frontend). Google OAuth 2.0 ensures secure, passwordless authentication. A user-friendly dashboard displays threat levels using visual indicators and analytics.

The literature review shows that while earlier methods like Naive Bayes and SVM achieved moderate success, newer approaches like Random Forest and deep learning offer higher accuracy but often lack real-world integration. Existing systems also struggle with usability, privacy concerns, and real-time deployment.

The proposed system addresses these gaps by combining machine learning with a complete production-ready platform that includes real-time scanning, secure authentication, audit logging, and cross-platform visualization. The model is trained on a spam dataset using TF-IDF features and optimized Random Forest classification.

Conclusion

This paper presented SpamGuard AI, a comprehensive realtime email threat detection system that addresses the limitations of traditional rule-based spam filters through modern machine learning techniques. The proposed system integrates a Random Forest classifier with TF-IDF feature extraction to achieve 97.8% classification accuracy on benchmark datasets, representing a substantial improvement over conventional filtering approaches. The system architecture successfully combines several modern technologies into a cohesive platform: Flutter enables cross-platform frontend deployment, FastAPI provides highperformance asynchronous backend services, Google OAuth 2.0 ensures secure authentication, and the Gmail API enables seamless integration with the world\'s most popular email service. Experimental evaluation confirms that the system meets its design objectives for accuracy, response time, and usability. The visual threat dashboard transforms abstract security data into actionable intelligence through color-coded indicators, confidence scoring, and statistical analytics. By making advanced email security accessible to non-technical users, SpamGuard AI contributes to broader cybersecurity awareness and protection. Future work will focus on expanding email provider support, incorporating multilingual NLP capabilities, experimenting with deep learning architectures, and implementing continuous learning mechanisms. The foundation established by this research provides a solid platform for these enhancements and demonstrates the practical viability of machine learning-based email security in production environments.

References

[1] S. J. Dixon, \"Global daily spam volume,\" Statista, 2023. [Online]. Available: https://www.statista.com/statistics/456500/daily-spam-volume/ [2] M. A. K. Rashid, M. H. M. K. Anwar, and M. H. Bhuiyan, \"A comprehensive study on email spam filtering techniques,\" Heliyon, vol. 5, no. 6, e01832, Jun. 2019. [3] N. Altwaijry, I. Al-Turaiki, R. Alotaibi, and F. Alakeel, \"Advancing Phishing Email Detection: A Comparative Study of Deep Learning Models,\" Sensors, vol. 24, no. 7, p. 2077, Mar. 2024. [4] K. A. Jackson, \"A Systematic Review of Machine Learning Enabled Phishing,\" arXiv:2310.06998, 2023. [5] A. Umam et al., \"Phishing email classification using SVC and Random Forest,\" in Proc. Int. Conf. Computer Science and Information Technology, 2023, pp. 45-52. [6] N. Altwaijry et al., \"Advancing phishing email detection: A comparative study of deep learning models,\" Sensors, vol. 24, no. 7, 2024. [7] I. R. Dwianti et al., \"Phishing Email Classification Using TF-IDF Method and Random Forest Algorithm,\" ROUTERS: Jurnal Sistem dan Teknologi Informasi, vol. 3, no. 2, pp. 125-135, Jul. 2025. [8] A. S. Adebanjo et al., \"A Random Forest Classifier-Based Email Spam Detection Model,\" Current Trends In Information Communication Technology Research (CTICTR), vol. 4, no. 1, pp. 126-136, Jun. 2025. [9] N. J. Saputra, \"Analysis of SMS Spam Detection Using TF-IDF: A Study on SMS Spam Collection Dataset,\" SoSTech Journal, vol. 2, no. 1, pp. 214220, 2023. [10] A. Kalamkar et al., \"Real-Time Phishing URL Detection in Chat Application Using Machine Learning and Flutter,\" Int. Journal of Advanced Research in Science, Communication and Technology (IJARSCT), vol. 5, no. 2, pp. 752-758, May 2025. [11] N. Elsayed, \"Context-Aware Phishing Email Detection Using Machine Learning and NLP,\" arXiv:2603.27326, Mar. 2026. [12] F. Pedregosa et al., \"Scikit-learn: Machine Learning in Python,\" J. Machine Learning Research, vol. 12, pp. 2825-2830, 2011. [13] S. L. T. Peixoto and G. L. Pappa, \"Email classification for phishing and spam detection,\" Proc. Brazilian Conf. on Intelligent Systems, pp. 114-119, 2021. [14] T. Chen and C. Guestrin, \"XGBoost: A Scalable Tree Boosting System,\" in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2016, pp. 785-794.

Copyright

Copyright © 2026 G. Prudhvi, Dr. U. Sathish, B.Giridhara Reddy, G. Vignesh, D. Vijay. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET81866

Publish Date : 2026-05-03

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here