Fast proliferation of social media platforms and online news portals leads to the easy spread of fake news, thereby compromising public trust, political stability, and societal well-being. The misleading textual content of fake news is quite often accompanied by manipulated or unrelated images, further increasing its detection challenge. Standard text-based detection mechanisms may not capture the interaction of language with visuals, hence more sophisticated multimodal approaches will be required. This paper proposes a multimodal fake news detection framework using Gemini API for both textual and visual analysis. The Gemini API will generate contextual embeddings and semantic representations of news text while extracting image features through its image understanding capability. The framework combines the textual and visual information to identify very minute discrepancies between text and imagery, which often prove indicative of fake news.
Experimental evaluation shows that the multimodal system has better metrics of accuracy, precision, recall, and F1-score compared to unimodal approaches. Given Gemini\'s rich language and image-processing capabilities, the model detects complex manipulations, such as misleading captions, doctored images, and semantic inconsistencies that often go undetected by classic detection methods. It is deployed as a web application intended for real-time news verification, where users can input news text with associated images and immediately get assessments of the authenticity of the news. This design facilitates scalability, efficiency, and practical applicability for the urgent need for automated, real-time fake news detection in digital media.
The results show the great potential of a large multimodal model harnessed via APIs to fight misinformation. Explainable predictions, multilingual capabilities, and integration with social-context features could offer additional enhancements, possibly improving overall performance, transparency, and applicability of the framework. The proposed system provides reliability, scalability, and usability in every aspect to effectively counter fake news dissemination.
Introduction
The rapid growth of social media and online news platforms has made information more accessible but also accelerated the spread of misinformation, which can distort public opinion, influence elections, and cause social unrest. Traditional detection methods based on text analysis (e.g., keywords, sentiment, linguistic patterns) are insufficient for modern misinformation, which often combines text and images to appear credible.
To address this, the study proposes a multimodal misinformation detection system that analyzes both text and images. It leverages advanced deep learning techniques and uses the Gemini API to extract semantic text embeddings and visual features. These are fused to detect inconsistencies between text and images—an important indicator of fake news. The system is implemented as a real-time web application, enabling users to verify news efficiently and at scale.
The literature shows a progression from classical machine learning models (like SVM and Naïve Bayes) to deep learning (CNNs, RNNs), and then to transformer-based models (e.g., BERT), which improved contextual understanding. Recent research emphasizes multimodal approaches, as they outperform unimodal systems by capturing cross-modal inconsistencies.
The proposed framework includes preprocessing, feature extraction (text and image), multimodal fusion, and classification. It improves accuracy, scalability, and real-time usability, making it suitable for practical deployment.
However, challenges remain, including handling subtle linguistic ambiguity, detecting sophisticated image manipulations, and effectively fusing multimodal features without bias toward one modality.
Conclusion
This paper presents a multimodal framework for fake news detection, using textual and visual analyses provided by the Gemini API. By fusing advanced text embeddings with image feature extraction, the framework identifies subtle inconsistencies and manipulations characteristic of deceptive news.The combination of modalities increases the accuracy, robustness, and practical applicability compared with unimodal approaches, hence addressing principal limitations related to traditional text-only or image-only methods. Empirical evaluations clearly reveal the performance of the proposed framework to be very strong on benchmark datasets, supported by various metrics such as accuracy, precision, recall, and F1-score. The Gemini API makes embedding generation very scalable and efficient, unlike most deep learning-based multimodal frameworks that require huge computational resources.Real-time deployment as a web-based application enhances practical usability, allowing users to verify news content rapidly and reliably.Its multimodal approach also highlights the importance of cross-modal analysis in countering misinformation. Detection of semantic discrepancies between text and images can help point out highly sophisticated fake news that may evade traditional detection techniques.This capability is particularly relevant for the digital media environment today, where many deceptive contents are designed to appear believable.
Despite this effectiveness, various challenges remain regarding semantic complexity, image ambiguity, optimisation of fusion strategies, and explainability. Employing explainable AI, multilinguality, social context embedding, and enhancing the computational efficiency brings us to a promising direction for future research.
In summary, this proposed system shows that large multimodal models tapped through APIs are a powerful yet practical tool for limiting the spread of fake news. The proposed framework, by fusing state-of-the-art semantic and visual understanding with scalability in deployment, is dependable, user-friendly, and adaptable; thus, it will be able to help meaningfully in limiting digital misinformation.
References
[1] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake News Detection on Social Media: A Data Mining Perspective,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22–36, 2017.
[2] Y. Zhou and X. Zafarani, “A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities,” ACM Computing Surveys, vol. 53, no. 5, pp. 1–40, 2021.
[3] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[4] S. Wang, J. Yang, S. Shu, H. Liu, and X. Wang, “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
[5] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258.
[6] Gemini API Documentation, OpenAI, 2025. [Online]. Available: s
[7] B. Li, Y. Wei, C. Guo, and A. Reddy, “Deep Neural Networks for Fake News Detection: A Systematic Literature Review,” Information Processing & Management, vol. 59, no. 3, May 2022.
[8] C. Castillo, M. Mendoza, and B. Poblete, “Information Credibility on Twitter,” Proceedings of the 20th International Conference on World Wide Web, 2011, pp. 675–684.
[9] A. Karimi and G. Tang, “Multimodal Rumor Detection in Social Media,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 4, pp. 1417–1430, Apr. 2021.
[10] P. Gupta and P. Kumaraguru, “Credibility Ranking of Tweets during High Impact Events,” Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, 2012, pp. 1–7.