In today\'s business landscape, online reviews play a crucial role in shaping commerce. A large portion of purchase decisions for online products is driven by customer feedback. Consequently, some individuals or groups try to manipulate product reviews to their advantage. Fake online reviews have a considerable effect on the experiences of online consumers, sellers, and e-commerce platforms. Although there has been academic research focused on identifying fake reviews, there is still a need for studies that thoroughly examine and summarize their origins and impacts. This work introduces a combination of semi-supervised and supervised text mining models to detect fake reviews, comparing their performance using datasets from reviews.
Introduction
Background
Online reviews significantly influence consumer decisions. However, distinguishing between genuine and fake reviews is difficult. Many fake reviews are written for promotional or deceptive purposes, leading to biased or misleading product perceptions. Due to the volume and complexity of data, manual detection is impractical, making automated detection critical.
Challenges
Fake reviews may be posted by non-buyers or paid individuals.
They often differ linguistically from genuine reviews.
Fake reviews may use more verbs, adverbs, and pronouns, while genuine reviews use descriptive and sensory language.
Reviewer behavior (e.g., review frequency, review content) can signal suspicious activity.
Related Work
Ott et al. (2011): Used SVM and Naive Bayes on hotel reviews, relying on linguistic patterns.
Mukherjee et al. (2013): Combined text and metadata (e.g., review burstiness).
Recent studies: Apply deep learning models like LSTM to better capture language context.
Proposed System
This project focuses on detecting fake movie reviews using a combination of text processing and machine learning techniques.
Workflow:
Data Collection: From IMDb/Kaggle, labeled as real or fake.
Preprocessing: Remove stopwords, normalize text, and lemmatize.
Feature Extraction: Use TF-IDF, sentiment scores, and review length.
Model Training: Use classifiers like Logistic Regression, SVM, and Random Forest.
Evaluation: Metrics used include accuracy, precision, recall, F1-score, and confusion matrix.
Technologies Used
Front-End: Streamlit (for user interaction)
Back-End: Python, FastAPI, SQLite
ML Libraries: scikit-learn, joblib, pandas, numpy
Development Tools: Visual Studio Code, Google Colab (for model training)
System Architecture
Input reviews are preprocessed.
Features are extracted from the cleaned text.
Classifiers (e.g., Linear SVM) predict if a review is real or fake.
Users interact with the system via a web interface.
Results
The model was deployed successfully and users can test it via a simple UI.
Linear SVM (SVC with linear kernel) achieved the highest accuracy (~88%) during evaluation.
The final deployed model uses LinearSVC, which is faster and easier to integrate.
Conclusion
In this paper, we propose a model for detecting fake reviews using machine learning algorithms, specifically Support Vector Machines (SVM). Our model demonstrates a high level of accuracy in identifying fraudulent reviews. Fake review detection is an emerging area of research, particularly due to the limited availability of open datasets. Through this project, our goal is to not only achieve high accuracy but also minimize the time required to identify fake reviews. Additionally, the model is designed to detect multiple fake reviews, making it a practical solution for real-world applications.
References
[1] Chengai Sun, Qiaolin Du and Gang Tian, “Exploiting Product Related Review Features for Fake Review Detection,” Mathematical Problems in Engineering, 2016.
[2] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, ”Detection of review spam: a survey”, Expert Systems with Applications, vol. 42, no. 7, pp. 3634–3642, 2015.
[3] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, “Finding deceptive opinion spam by any stretch of the imagination,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), vol. 1, pp. 309–319, Association for Computational Linguistics, Portland, Ore, USA, June 2011.
[4] J. W. Pennebaker, M. E. Francis, and R. J. Booth, ”Linguistic Inquiry and Word Count: Liwc,” vol. 71, 2001.
[5] S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deception detection,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 2, 2012.
[6] J. Li, M. Ott, C. Cardie, and E. Hovy, “Towards a general rule for identifying deceptive opinion spam,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014.