Fake Reviews Detection System Using Machine Learning and Natural Language Processing

Authors: Aryan Gupta, Aayush Singhal, Raj Chauhan, Sakib

DOI Link: https://doi.org/10.22214/ijraset.2026.81157

Abstract

Online reviews constitute one of the most consequential determinants influencing contemporary consumer purchasing behavior, with empirical studies indicating that approximately 93% of consumers consult online reviews prior to making purchasing decisions, while 91% of individuals aged 18 to 34 place equivalent trust in online reviews as in personal recommendations. The proliferation of fraudulent reviews across e-commerce platforms, travel aggregators, and hospitality services has fundamentally compromised the integrity of the online review ecosystem, with research estimates suggesting that between 16% to 30% of all online reviews are fabricated or deceptive in nature. This research presents a comprehensive Fake Reviews Detection System that leverages machine learning classification algorithms, specifically Naive Bayes and Support Vector Machine (SVM), integrated with Natural Language Processing (NLP) preprocessing pipelines and TF-IDF feature extraction methodologies to automatically identify and classify online reviews as either truthful or deceptive. The proposed system processes review text through systematic data cleaning, tokenization, stop-word removal, and vectorization stages before training supervised classification models on the Deceptive Opinion Spam Corpus. Experimental evaluation demonstrates that the SVM classifier achieves classification accuracy of 89.6% while the Naive Bayes classifier attains 86.3% accuracy, with the integrated system providing real-time detection capability through an accessible web-based interface built using the Flask framework.

Introduction

The text explains the problem of fake online reviews in e-commerce and their major impact on consumer trust and global economic activity. Online reviews strongly influence buying decisions, but fraudulent reviews—both positive and negative—are widespread and increasingly sophisticated, especially with the rise of AI-generated text. Traditional detection methods like manual moderation and rule-based filtering are not scalable and fail to adapt to evolving deception strategies, making them ineffective for modern platforms.

To address this, the study proposes a machine learning-based fake review detection system using NLP techniques and TF-IDF feature extraction. The system preprocesses review text (cleaning, tokenization, stop-word removal) and converts it into numerical features. Two classifiers—Naive Bayes and Support Vector Machine (SVM)—are used to detect whether a review is genuine or fake. A Flask web application is developed to provide real-time prediction with confidence scores, making the system accessible for practical use.

The literature survey shows that prior research has explored various approaches such as stylometric analysis, ensemble models, deep learning, and behavioral features, generally achieving good accuracy but often suffering from issues like lack of interpretability, high computational cost, and limited real-time deployment.

In implementation, the system is trained using the Deceptive Opinion Spam Corpus, with balanced truthful and fake reviews. TF-IDF vectorization (including unigrams and bigrams) is used to represent text, and models are trained and tested using a 70–30 split. The backend is built using Flask, enabling real-time inference and user interaction through a simple web interface.

Results show that SVM performs better (89.6% accuracy) than Naive Bayes (86.3%), while Naive Bayes is faster in computation. Both models are efficient enough for real-time deployment. User evaluation indicates good satisfaction, especially due to the dual-model comparison feature.

Conclusion

This investigation effectively demonstrates the efficacy of machine learning classification approaches in delivering accurate, accessible fake review detection through automated text analysis, providing an extensive solution for augmenting platform content moderation capabilities and addressing critical consumer trust challenges in the online review ecosystem. The proposed dual-model architecture employing Naive Bayes and SVM classifiers operating on TF-IDF feature representations achieves robust performance with 89.6% SVM classification accuracy and 0.943 AUC-ROC, exhibiting the practical feasibility of automated deceptive review identification instruments. The implemented Flask web application effectively converts research discoveries into a practical instrument for instantaneous authenticity assessment, enabling platform moderators and consumers to access automated review analysis regardless of specialized data science expertise availability. The SVM classification module demonstrates robust capability in distinguishing deceptive from truthful online reviews with 89.6% standalone accuracy, while the Naive Bayes classifier provides complementary rapid-inference classification with 86.3% accuracy suitable for high-throughput preprocessing applications. The dual-model presentation enables users to assess prediction reliability through model agreement analysis, with concordant predictions providing enhanced confidence in classification outcomes. The web-based deployment architecture successfully decouples computational processing from user interaction, enabling accessible authenticity assessment through browser-based interfaces suitable for diverse content moderation and consumer protection deployment scenarios. Regarding future endeavors, several enhancements can be investigated to additionally reinforce the framework\'s capabilities and practical usefulness. Extending the classification architecture to incorporate deep learning models including BERT-based fine-tuned encoders and recurrent neural network architectures would substantially enhance detection capability by capturing contextual semantic relationships that bag-of-words feature representations fail to adequately represent. Implementing ensemble fusion mechanisms that combine predictions from multiple classifiers through weighted voting or stacking would provide additional accuracy improvements while maintaining the uncertainty quantification benefits of multi-model architectures. Furthermore, integrating behavioral feature analysis encompassing reviewer posting patterns, rating distribution anomalies, and temporal activity characteristics would enable multi-modal detection that addresses sophisticated fake review strategies that content analysis alone cannot adequately identify.

References

[1] R. Alghamdi, K. Alfalqi, and M. Alharbi, \"Combining Stylometric and Sentiment Mining Approaches for Deceptive Opinion Spam Detection,\" IEEE Access, vol. 12, pp. 34521-34536, 2024. [2] S. Patel, A. Nair, and V. Krishnan, \"An Approach to Improve the Accuracy of Detecting Spam in Online Reviews Using Ensemble Feature Selection,\" IEEE International Conference on Computing, Communication, and Intelligent Systems, pp. 289-296, 2024. [3] H. Zhang, L. Wang, and Y. Chen, \"Machine Learning-Based Opinion Spam Detection: A Systematic Review,\" IEEE Access, vol. 12, pp. 52341-52359, 2024. [4] M. Rahman, T. Begum, and S. Islam, \"Effect of TF-IDF Extraction and Application of SMOTE on Model Performance in Detecting Spam Email,\" IEEE International Conference on Information and Communication Technology, pp. 178-183, 2023. [5] K. Nakamura, D. Kim, and T. Watanabe, \"Neural Embedding and Hybrid ML Models for Text Classification: A Comparative Study,\" IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2341-2356, 2024. [6] F. Garcia, R. Martinez, and J. Lopez, \"Behavioral Feature Analysis for Identifying Deceptive Online Reviews Using Reviewer Profiling,\" IEEE International Conference on Big Data, pp. 3456-3463, 2023. [7] A. Sharma, P. Gupta, and R. Kumar, \"Sentiment-Aware Feature Fusion for Enhanced Fake Review Detection in E-Commerce Platforms,\" IEEE Transactions on Computational Social Systems, vol. 11, no. 3, pp. 1567-1580, 2024. [8] W. Liu, Q. Zhang, and X. Li, \"Cross-Domain Fake Review Detection Using Transfer Learning and Domain Adaptation,\" IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 234-241, 2023. [9] N. Johnson, C. Moore, and G. Evans, \"Automated Content Moderation for Online Review Platforms Using Hybrid Classification Architectures,\" IEEE Conference on Artificial Intelligence, pp. 1456-1463, 2024. [10] D. Thompson, B. Wilson, and E. Clark, \"TF-IDF Feature Engineering Optimization for High-Dimensional Text Classification Tasks,\" IEEE Signal Processing Letters, vol. 31, pp. 678-682, 2024. [11] L. Anderson, S. Wright, and K. Adams, \"Comparative Evaluation of Traditional and Deep Learning Classifiers for Deceptive Text Identification,\" IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 6, pp. 4567-4581, 2024. [12] J. Park, H. Lee, and M. Choi, \"Real-Time Fake Review Detection System with Web-Based Deployment for E-Commerce Applications,\" IEEE International Conference on Web Services, pp. 567-574, 2023. [13] C. Davis, F. Robinson, and T. Brown, \"Ethical Implications and Regulatory Frameworks for Automated Content Authenticity Assessment Systems,\" IEEE Transactions on Technology and Society, vol. 5, no. 2, pp. 89-103, 2024.

Copyright

Copyright © 2026 Aryan Gupta, Aayush Singhal, Raj Chauhan, Sakib . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET81157

Publish Date : 2026-04-26

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here