Unmasking Deepfakes: A Multi-Modal Hybrid Detection Framework

Authors: Samson Mandava

DOI Link: https://doi.org/10.22214/ijraset.2025.73572

Abstract

The rapid evolution of deepfake technology has intensified the challenge of ensuring media authenticity, driving the need for sophisticated detection methods that go beyond conventional techniques in both adaptability and effectiveness. This paper introduces an innovative deepfake detection system that seamlessly integrates behavioral and visual analysis, eliminating the dependency on custom training datasets. By harnessing pre-trained models from trusted repositories, the system employs a triple-detection pipeline — comprising MTCNN, DLib, and Mediapipe Face Mesh — to achieve reliable face and landmark identification across a wide range of video inputs, ensuring resilience even with challenging or low-quality footage. At its core, the framework analyzes a rich set of features to distinguish authentic from synthetic content, including eye blinking dynamics such as frequency, period, duration, and symmetry, alongside lip movement consistency and temporal frame coherence assessed through optical flow. These behavioral cues are complemented by EfficientNet-B7, a state-of-the-art model that enhances detection by identifying pixel-level anomalies often present in deepfake videos. Implemented on Google Colab, this system processes user-uploaded videos in real-time, employing an optimized ensemble method with confidence-weighted scoring to classify content as \"Real\" or \"Fake,\" offering a practical and accessible solution for media verification. Extensive debugging and adaptive threshold tuning bolster the system’s reliability against modern deepfakes, addressing the shortcomings of single-feature approaches like the original DeepVision framework. By combining multiple detection modalities and leveraging cloud-based computation, this lightweight and scalable tool surpasses traditional limitations, providing a robust defense against synthetic media. This work represents a significant step forward in deepfake detection, adaptable to the evolving landscape of digital content manipulation and suitable for real-world applications.

Introduction

Deepfakes are hyper-realistic synthetic media generated using deep learning, particularly GANs (Generative Adversarial Networks). They replicate facial expressions, voices, and gestures with high accuracy, challenging the trustworthiness of digital media. The misuse of deepfakes poses serious threats in politics, media, security, and social platforms, potentially spreading misinformation, distorting public perception, and eroding trust in democratic processes.

Challenges

Traditional detection methods focused on low-level inconsistencies (e.g., pixel artifacts, unnatural lighting), but these are becoming less effective as deepfake generation techniques improve. The diversity and sophistication of manipulations—including face swaps, expression edits, and voice synthesis—demand more advanced, multi-modal and robust detection systems.

Related Work

Early Approaches: Relied on pixel-level analysis and visual inconsistencies (e.g., convolutional traces, misalignment).
Deep Learning Models: CNNs and transfer learning have improved performance in identifying subtle artifacts.
Behavioral Cues: Eye-blinking and facial landmark tracking help detect physiological inconsistencies.
Ensemble & Hybrid Models: Systems like MMGANGuard and transformer-based approaches integrate multiple cues for greater generalizability and accuracy.

Proposed Methodology

The paper introduces a hybrid deepfake detection framework combining three core modules:

A. Behavioral Analysis

Focuses on eye-blinking patterns—blink rate, duration, and symmetry—using MediaPipe Face Mesh.
Exploits the difficulty of accurately synthesizing involuntary behaviors.

B. Visual Feature Extraction

Uses EfficientNet-B7 (fine-tuned on deepfake datasets) to extract texture, shading, and boundary inconsistencies.
Optional use of XGBoost or MLP classifiers for enhanced prediction from feature maps.

C. Temporal Consistency Analysis

Uses optical flow (Farneback or PWC-Net) to detect unnatural motion between frames.
Identifies temporal discontinuities introduced during frame-by-frame generation.

D. Fusion and Decision Layer

Combines outputs from all modules using score normalization, weighted fusion, and a threshold-based classifier.
Optional use of explainable AI (e.g., Grad-CAM) for model interpretability.

Workflow

Input: Raw video
Preprocessing: Frame extraction, resizing, landmark detection
Parallel Analysis:
- Blink detection
- Visual feature classification
- Motion analysis
Fusion: Combine anomaly scores
Output: Deepfake probability and binary classification

Results & Evaluation

Dataset: Benchmarked on labeled datasets like FaceForensics++ and Celeb-DF.
Performance:
- Accuracy: 91.2%
- Precision: 89.5%
- Recall: 92.8%
- F1-score: 91.1%

Module Contributions:

Behavioral analysis excelled at detecting poorly modeled physiological features.
EfficientNet-B7 was sensitive to spatial inconsistencies.
Temporal analysis identified unrealistic motion, especially in facial areas.

Robustness:

Effectively detected:
- Face swaps
- Expression edits
- Lip-sync and voice synthesis
Maintained low false-positive rates, ensuring reliability in real-world applications (e.g., media forensics, journalism).

Conclusion

The proposed hybrid deepfake detection framework represents a significant step forward in the ongoing battle against synthetic media manipulation. By intelligently integrating behavioral analysis, deep visual feature extraction, and temporal consistency checks, the system addresses the shortcomings of unimodal approaches and delivers a comprehensive, high-performing detection solution. The experimental results affirm that this multi-modal architecture not only achieves high accuracy (91.2%), but also exhibits strong resilience across a variety of deepfake styles and generation methods. The framework\'s ability to capture both low-level pixel anomalies and high-level human behavioral inconsistencies provides a powerful defense against evolving deepfake technologies. This study underscores the importance of using biologically inspired and temporally aware strategies in combination with modern deep learning techniques to strengthen media verification tools. The modular design of the system also enables future extensions and fine-tuning, enhancing its applicability in diverse operational environments. Future Scope 1) While the current system shows promising results, several directions are proposed for future enhancement: 2) Real-time performance optimization: Reducing inference time and computational load to enable deployment on mobile or edge devices. 3) Audio-visual fusion: Incorporating voiceprint analysis and speech-lip sync verification to detect multimodal deepfakes more effectively. 4) Domain adaptation: Improving generalization by training on a wider array of synthetic media generated by the latest GAN and transformer-based models. 5) Explainability: Introducing interpretable AI techniques (e.g., saliency maps, Grad-CAM) to provide visual insights into the system’s predictions, fostering trust and transparency in forensic settings. In conclusion, the proposed framework lays a solid foundation for building robust, extensible, and real-world-ready deepfake detection systems—a crucial step toward safeguarding the integrity of digital information in the age of synthetic media.

References

[1] Mdshohel Rana, Mohammad Nur Nobi, Beddhu Murali, And Andrew H. Sung ,“Deepfake Detection: A Systematic Literature Review [2] Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, “Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward” [3] Mohammad A. Hoque, And Sasu Tarkoma, Md sadek ferdous, Mohsin Khan, “Real, Forged or Deep Fake? Enabling the Ground Truth on the Internet”. [4] Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al- Shamma, O., Santamaria, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L.\" Review of deep learning concepts,CNN architectures, challenges, applications, future directions\". [5] Sarker, I. H. Deep Learning: “A Comprehensive Overview on Techniques, Taxonamy, Applications and Research Directions”. [6] Momotazbegum, Mehedihasanshuvo, Mostofa Kamal Nasir, Amran Hossain, Mohammad Jakir Hossain, Imran Ashraf, Jia Uddin, And Md. Abdussamad “LCNN: Lightweight CNN Architecture for Software Defect Feature Identification Using Explainable AI”. [7] Khan, M. 2. Gajendran, M. K., Lee, Y., & Khan, M. A., “Deep Neural Architectures for Medical Image Semantic Segmentation:”. [8] Adwa Alrowais, Meshari H. Alanazi, Asmaabbashassan, Wafasulaiman Almukadi, Radwamarzou, And Ahmedmahmud, “Boosting Deep Feature Fusion-Based Detection Model for Fake Faces Generated by Generative Adversarial Networks for Consumer Space Environment”. [9] Abdulqader M. Almars, “Deepfakes Detection Techniques Using Deep Learning: A Survey”. [10] Yogesh Patel, Rajesh Gupta, Sudeep Tanwar, Pronaya Bhattacharya, Innocent Ewean Davidson, Royi Nyameko, Srinivas Aluvala, And Vrince Vimal, “Deepfake Generation and Detection: Case Study and Challenges”. [11] Tackhyun Jung, Sangwon Kim, and Keecheon Kim,‘‘Deep Vision: Deepfakes Detection Using Human Eye Blinking Pattern,’’. [12] Syed Abdul Rahman ,Syed Abu Bakar and Bilal Ashfaq Ahmed,“DeepFake on Face and Expression Swap: A Review\". [13] Uca Guarnera and Sebastiano Battiato, Oliver Giudice,\"Fighting Deepfake by Exposing the 35 Convolutional Traces on Images\". [14] Kurniawan Nur Ramadhani, Rinaldi Munir, and Nugraha Priya Utama, “Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depth wise Separable Convolution, and Self Attention”. [15] Jihyeon Kang, Sang-Keun Ji, and Jong-Uk Hou, Sangyeong Lee, Daehee Jang,“Detection Enhancement for Various Deepfake Types Based on Residual Noise and Manipulation Traces”. [16] Aya Ismail , Marwa Elpeltagy , Mervat S. Zaki and Kamal Eldahshan \"A New Deep Learning-Based Methodology for Video Deepfake Detection Using XGBoost\". [17] Syed Ali Raza, Usman Habib, Muhammad Usman, Adeel Ashraf Cheema, Muhammad Sajid Khan \"MMGANGuard: A Robust Approach for Detecting Fake Images Generated by GANs using Multi-Model Techniques\". [18] Ashgan H. Khalil, Atef Z. Ghalwash, Hala Abdel-Galil Elsayed, Gouda I. Salama, and Haitham A. Ghalwash, “Enhancing Digital Image Forgery Detection Using Transfer Learning\" [19] Samuel Henrique Silva, Mazal Bethany, Alexis Megan Votto, Ian Henry Scarff, Nicole Beebe, Peyman Najafirad,“Deepfake forensics analysis: An explainable hierarchical ensemble of weakly supervised models”. [20] Ching-Chun Chang, Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen,\"Cyber Vaccine for Deepfake Immunity\". [21] Eunji Kim and Sungzoon Cho,“Exposing Fake Faces Through Deep Neural Networks Combining Content and Trace Feature Extractors”. [22] Vivek Mahajan, Vishal Waghmare, Ashwin Wani, Sushant Jogdand, “A Survey on Deep Learning Based Deep Fake Detection”. [23] Rami Mubarak, Tariq Alsboui, Saad Khan, Omar Alshaikh, Isa Inuwa-Dutse, and Simon Parkinson “A Survey on the Detection and Impacts of Deepfakes in Visual, Audio, and Textual Formats”. [24] Tackhyun Jung, Sangwon Kim, and Keecheon Kim ,“DeepFake Detection for Human Face Images and Videos: A Survey”.

Copyright

Copyright © 2025 Samson Mandava. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET73572

Publish Date : 2025-08-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here