The rapid expansion of digital communication and social media, the spread of fake news has become a growing concern. Detecting and filtering out fake news is crucial, yet it remains a challenging task due to limited datasets and effective analysis techniques.This study presents a machine learning-based approach to detecting fake news. The system extracts textual features using Term Frequency-Inverse Document Frequency (TF-IDF) with bag-of-words and n-grams. A Support Vector Machine (SVM) classifier is then employed to differentiate between authentic and fake news. Additionally, a dataset containing both real and fake news articles is introduced for training the model. The results highlight the effectiveness of the proposed system in accurately identifying misinformation.
Introduction
Problem Context:
The rapid spread of fake news, fueled by social media, poses significant risks by manipulating public opinion, harming reputations, and spreading misinformation—especially on health-related topics like COVID-19. The WHO has even warned about an "infodemic" where the overload of both true and false information confuses the public.
Proposed Solution: Machine Learning Model for Fake News Detection
Key Steps in the Approach:
Text Preprocessing: Clean the data by removing stop words, punctuation, and special characters.
Text Representation: Use Bag-of-Words, N-Grams, and TF-IDF to convert text to numerical form.
Feature Extraction: Analyze metadata like source, author, date, and sentiment.
Classification: Use a Support Vector Machine (SVM) model to classify news as real or fake, assigning a confidence score instead of just a binary label.
Related Work Overview:
Past studies used classifiers like Naive Bayes and LSVM with varying accuracies.
TF-IDF combined with LSVM yielded 92% accuracy, but LSVM struggles with complex, non-linear data.
Some methods included multimedia or social metadata but often ignored metadata like the author or source.
Researchers highlighted the need for confidence-based classification rather than binary labels.
System Design & Implementation:
A. Preprocessing:
Textual Data: Cleaned, stemmed, and numerically encoded.
Categorical Data: Sources and authors encoded for better pattern recognition.
Numerical Data: Date split into components; sentiment analysis performed.
B. Model Training & Validation:
SVM model trained and validated using cross-validation.
Classification based on a confidence score (positive = real, negative = fake).
C. Optimization:
SVM parameters like cost, kernel type, gamma, and epsilon were fine-tuned to maximize accuracy.
D. Deployment:
Once optimized, the model is used to classify new articles and provide a confidence score for reliability.
Experiments & Results:
Dataset Used:
Combined two datasets:
Fake news from 244 flagged websites (12,999 entries)
Real news from major outlets (e.g., CNN, NYT, Reuters)
Features included top words, N-grams, date, sentiment, source, author, and label.
Findings:
Bag-of-Words and 2-word N-Grams were most effective.
Sentiment score had limited impact.
Source, author, and date greatly improved model accuracy.
Best results came from encoding the author's name, achieving 100% accuracy.
Final Model Parameters:
Cost (C): 300
Epsilon (ε): 0.0001
Gamma (γ): 0.001
Linear and polynomial SVM kernels performed best.
Conclusion
Our study confirms that Support Vector Machine (SVM) is highly effective in identifying fake news. Key takeaways include:
1) The most crucial features for detection are text, author, source, and date.
2) N-Gram models outperform Bag-of-Words when analyzing larger datasets.
3) SVM provides superior accuracy while also assigning confidence scores to its classifications.
4) Future enhancements could involve expanding the dataset and implementing real-time updates for continuous learning.
References
[1] Ahmed, H., Traore, I., & Saad, S. (2017). Detection of online fake news using N-gram analysis and machine learning techniques. Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments, 127–138. Springer.
[2] Chang, C.-C., & Lin, C.-J. (2018). LIBSVM – A Library for Support Vector Machines.
[3] Conroy, N. J., Rubin, V. L., & Chen, Y. (2015). Automatic deception detection: Methods for finding fake news. Proceedings of the Association for Information Science and Technology, 52(1), 1–4.
[4] Faloutsos, C. (1985). Access methods for text. ACM Computing Surveys (CSUR), 17(1), 49–74.
[5] Granik, M., &Mesyura, V. (2017). Fake news detection using Naïve Bayes classifier. IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), 900–903. IEEE.
[6] Kaggle. (2016). Getting Real about Fake News.
[7] Kaggle. (2017). All the News.
[8] Khan, J. Y., Khondaker, M., Islam, T., Iqbal, A., & Afroz, S. (2019). A benchmark study on machine learning methods for fake news detection. arXiv preprint arXiv:1905.04749.
[9] Mai-grot, C., Kijak, E., & Claveau, V. (2018). Fusion par apprentissage pour la détection de faussesinformationsdans les réseauxsociaux. Document Numérique, 21(3), 55–80.
[10] Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. Encyclopedia of Database Systems, 532–538.
[11] Pulido, C. M., Ruiz-Eugenio, L., Redound-Sama, G., &Villarejo-Carballido, B. (2020). A new application of social impact in social media for overcoming fake news in health. International Journal of Environmental Research and Public Health, 17(7), 2430.
[12] Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning, 242, 133–142. New Jersey, USA.
[13] Salton, G., & Mc Gill, J. M. (1983). Introduction to Modern Information Retrieval.
[14] Sauvageau, F. (2018). Les faussesnouvelles, nouveaux visages, nouveaux défis. Presses de l’Université Laval.
[15] Scholkopf, B., & Smola, A. J. (2018). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning Series.