Deepfake media has rapidly emerged as one of the most concerning consequences of recent advancements in artificial intelligence and generative modelling. With tools like Generative Adversarial Networks (GANs), facial reenactment models, and audio-cloning systems becoming publicly accessible, even nonexperts can now fabricate highly realistic audio and video content. Therefore , detecting deepfake content has become an important requirement for protecting online calling platforms. While many of the research focus on identifying deceived content in offline methods, very few platforms provide solutions that function during real-time communication such as video conferencing and voice calls. This paper gives a hybrid deepfake detection system, analyzes both video and audio signals during live calls. A small and fast convolutional neural network checks for uncertainties in the video, another model spectrogram-based classifier examines at sound patterns to find anything unusual. This system is developed to operate with immediate response so that users can be alerted immediately during an live call. Several experiments were carried out using kaggle and github based public datasets and custom-generated deepfake clips to ensure real-time performance. The results demonstrate reasonably high detection accuracy and low latency even on midrange hardware, making the model suitable for deployment on mobile devices or integrated into existing communication software. This paper aims to contribute a student-developed, resource efficient, and real-time applicable approach to safeguarding digital interactions.
Introduction
The text presents a real-time multimodal deepfake detection framework designed to identify manipulated audio and video during live communication, addressing a major limitation of existing deepfake detection systems that operate only on prerecorded media. With rapid advances in generative AI—including GANs, transformers, and diffusion models—deepfakes have become easier to create and harder to detect, leading to serious security threats such as impersonation, fraud, misinformation, and social engineering, especially during live calls on platforms like Zoom, Google Meet, WhatsApp, and phone calls.
To tackle this challenge, the proposed system performs simultaneous audio and video analysis in real time. It integrates transformer-based pretrained models from Hugging Face: VideoMAE for detecting facial and visual manipulations, and Wav2Vec2 for identifying synthetic or cloned speech. These models are optimized using ONNX and TensorFlow Lite to achieve low-latency inference suitable for live streaming. A FastAPI backend supports scalable processing, while a microservice-style architecture enables easy integration across platforms.
The methodology includes live data acquisition via WebRTC stream interception, device-side preprocessing, a dual-model inference pipeline, decision fusion, and instant alert generation. The system is deployed through browser extensions and Android services, requiring no modification to the communication platforms themselves. Privacy is emphasized by performing most preprocessing and early inference directly on the user’s device, minimizing data transfer.
Experimental results show strong performance, achieving around 93–94% accuracy for video detection and 94–95% accuracy for audio detection, with end-to-end latency generally under 520 ms, making it practical for real-time use. Testing across real platforms demonstrated stable performance under typical conditions, though accuracy declined slightly with extreme noise, low resolution, or heavy motion blur. User studies confirmed that real-time alerts were effective without disrupting conversations.
Overall, the proposed framework significantly advances deepfake defense by enabling low-latency, real-time detection during live interactions, outperforming traditional offline tools. By combining multimodal analysis, transformer-based models, and cross-platform deployment, the system offers a practical and timely solution to modern deepfake threats in real-world communication environments.
Conclusion
The real-time Deepfake Detection system for voice and video calls was developed to address the growing security risks associated with synthetic media. As AI-generated audio and video become more lifelike, the chances of users being targeted through impersonation, misinformation, or online fraud have increased sharply. This project demonstrates that it I possible to detect manipulated content during a live conversation, instead of waiting until after the call has ended
The system uses a hybrid architecture built around deep learning models, a FastAPI-based backend, and accessible interfaces for both desktop and mobile users. This design allows audio and video-were fine-tuned to achieve accuracy level above 90%, while still maintaining response times short enough to avoid interrupting call flow.
The audio component relies on a CNN-LSTM structure to capture temporal speech characteristics, whereas the video module uses 3D-CNN-based features to identify frame-level inconsistences typical of deepfake generation. A websocket-driven backend ensures continuous data exchange between the client and inference server, providing immediate feedback whenever suspicious activity is detected.
References
[1] Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018). MesoNet: A Compact Facial Video Forgery Detection Network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS) (pp. 1–7). IEEE. https://doi.org/10.1109/WIFS.2018.8630761
[2] Korshunov, P., & Marcel, S. (2018). Deepfakes: A New Threat to Face Recognition? Assessment and Detection. arXiv:1812.08685. https://arxiv.org/abs/1812.08685
[3] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019). FaceForensics++: Learning to Detect Manipulated Facial Images. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 1–11). Dataset: https://github.com/ondyari/FaceForensics
[4] Li, Y., & Lyu, S. (2018). Exposing DeepFake Videos by Detecting Face Warping Artifacts. arXiv:1811.00656. https://arxiv.org/abs/1811.00656
[5] Zhou, P., Han, X., Morariu, V. I., & Davis, L. S. (2018). Two-Stream Neural Networks for Tampered Face Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1831–1839).
[6] Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., & Ferrer, C. C. (2020). The DeepFake Detection Challenge (DFDC) Dataset.
[7] arXiv:2006.07397. https://arxiv.org/abs/2006.07397
[8] Khan, S., Rahmani, H., Shah, S. A. A., & Bennamoun, M. (2018). A Guide to Convolutional Neural Networks for Computer Vision. Synthesis Lectures on Computer Vision, 8(1), 1–207. Morgan Claypool Publishers. https://doi.org/10.2200/S00822ED1V01Y201712COV015
[9] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition (ResNet). In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). https://arxiv.org/abs/1512.03385
[10] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS) (pp. 5998–6008). https://arxiv.org/abs/1706.03762 PyTorch. (2024). PyTorch: An Open Source Deep Learning Platform. https://pytorch.org
[11] TensorFlow. (2024). TensorFlow: Machine Learning for Everyone. https://www.tensorflow.org
[12] Hugging Face. (2024). Transformers: State-of-the-Art Machine Learning Models. https://huggingface.co/docs/transformers
[13] Google. (2024). WebRTC: Real-Time Communication Components. https://webrtc.org
[14] MongoDB Inc. (2024). MongoDB Documentation. https://www.mongodb.com/docs
[15] FastAPI. (2024). FastAPI: Modern, Fast Web Framework for Building APIs with Python. https://fastapi.tiangolo.com
[16] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN). In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(pp.4401–4410. https://arxiv.org/abs/1812.04948