Deepfake Detection Images and Videos Using LSTM and ResNext CNN
Authors: Mr. R. Vamsidhar Raju, Mr. S. Janakiram, Mr. P. Reddy Prasad, Mr. B. Lohith, Mr. N. Vijaya Kumar, Dr. R. Karunia Krishnapriya, Mr. V Shaik Mohammad Shahil, Mr. Pandreti Praveen
The growing power of deep learning algorithms has made creating realistic, AI-generated videos and Images, known as deepfakes, relatively easy. These can be used maliciously to create political unrest, fake terrorism events. To combat this, researchers have developed a deep learning-based method to distinguish AI-generated fake videos from real ones. This method uses a combination of Res-Next Convolution neural networks and Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNN).
The Res-Next Convolution neural network extracts frame-level features, which are then used to train the LSTM-based RNN. This RNN classifies whether a video is real or fake, detecting manipulations such as replacement and reenactment deepfakes. To ensure the model performs well in real-time scenarios, it\'s evaluated on a large, balanced dataset combining various existing datasets like the Deepfake Detection Challenge and Celeb-DF. This approach achieves competitive results using a simple yet robust method.
Introduction
1. Problem Statement
Deepfakes are AI-generated media that pose significant threats to national security, personal privacy, and public trust.
Existing detection methods often rely on outdated computer vision or basic ML techniques, which struggle against sophisticated deepfake algorithms.
2. Proposed Solution
A hybrid deep learning model combining:
ResNeXt CNN for spatial feature extraction from video frames.
LSTM networks for temporal sequence modeling to detect inconsistencies over time.
3. Objectives
Build and test a robust deepfake detection system.
Benchmark performance against state-of-the-art methods.
Ensure resilience across diverse deepfake types and attacks.
4. Methodology
Data Collection & Preprocessing:
Used datasets: FaceForensics++, DFDC, Celeb-DF.
Extracted, resized, and face-cropped video frames (224×224).
Applied normalization and data augmentation.
Model Design:
ResNeXt-50 used to extract 2048-dimensional spatial features.
These features are input to a 2-layer LSTM network (128 units each) to model temporal dependencies.
Output is passed through a fully connected layer with a sigmoid activation for binary classification.
Training Setup:
Optimizer: Adam (learning rate = 0.0001).
Loss: Binary Cross-Entropy.
Batch Size: 32.
Epochs: 20–50 (with validation monitoring for overfitting).
Model Pipeline:
Input videos are preprocessed to extract only facial regions.
ResNeXt extracts frame-wise features.
LSTM processes frame sequences for motion and temporal patterns.
A final classifier predicts if the video is real or fake.
5. Results & Discussion
Performance Metrics:
Accuracy: 94.2%
Precision: 92.7%
Recall: 95.6%
ROC-AUC: 97.8%
The hybrid CNN-LSTM architecture significantly outperformed standalone models.
Strengths:
Detects subtle artefacts and unnatural motions.
Effective against various deepfake types.
Limitations:
Performance slightly drops on unseen datasets (cross-dataset testing).
Requires larger and more diverse training datasets or domain adaptation.
6. Literature Review Insights
Prior methods used CNNs to detect face artefacts or relied on eye blinking, but lacked temporal analysis.
RNN-based approaches had small datasets and limited real-time applicability.
The proposed model addresses these gaps by combining spatial and temporal features.
7. Future Improvements
Expand dataset diversity for better generalization.
Integrate domain adaptation techniques.
Optimize for real-time processing and robustness.
Conclusion
In this article, we presented a hybrid deep learning method that combines the temporal sequence modelling capability of LSTM networks with the spatial feature extraction skills of ResNeXt CNN to detect deepfake photos and videos. The artificial motion patterns and visual imperfections characteristic of deepfake footage are effectively portrayed by the proposed model. Experimental results on benchmark datasets demonstrated outstanding accuracy and robustness, outperforming traditional single-stream models. This illustrates how effectively temporal and spatial information can be combined for deepfake detection. Even if the model performs well, further studies can focus on improving cross-dataset generalisation and looking at lightweight structures for real-time applications in digital forensics and social media surveillance.
References
[1] \"Face Forensics++: Learning to Detect Manipulated Facial Images\" by Andreas Rossler, DavideCozzolino, Luisa Verdolaga, Christian Riess, Justus Thies, and Matthias Nießner, arXiv:1901.08971.
https://www.kaggle.com/c/deepfake-detectionchallenge/data
[2] Deepfake detection challenge dataset Retrieved March 26, 2020
[3] \"Celeb-DF: A Large-scale Challenging Dataset for Deepfake Forensics\" by Yuezun Li, Xin Yang, Pu Sun, Hanggang Qi, and SiweiLyu, arXiv:1909.12962
[4] On the eve of the House AI hearing, a deepfake video of Mark Zuckerberg becomes viral:
https://fortune.com/2019/06/12/deepfake-mark-zuckerberg/ retrieved on March 26, 2020
[5] Ten deepfake examples that made people laugh and frighten online:
Deepfake-examples: https://www.creativebloq.com/features retrieved on March 26, 2020
[6] https://www.tensorflow.org/ is TensorFlow. (retrieved March 26, 2020)
[7] Kera’s: (Accessed March 26, 2020) https://keras.io/
[8] PyTorch: (Accessed March 26, 2020) https://pytorch.org/
[9] J.-L. Dugelay, M. Baccouche, and G. Antipov. Use conditional generative adversarial networks to combat ageing. February 2017, arXiv:1702.01983.
[10] Thies J. et al. Face2Face: Real-time rgb video reenactment and face capture.IEEE Conference on Computer Vision and Pattern Recognition Proceedings, June 2016, pp. 2387–2395. Nevada\'s Las Vegas.
[11] The Face app is available at https://www.faceapp.com/. (retrieved March 26, 2020)
[12] Face Swap can be found at https://faceswaponline.com/ (retrieved March 26, 2020)
[13] https://www.forbes.com/sites/chenxiwang/2019/11/01/deepfakes-revenge-porn-and-the-impact-on-women/ Deepfakes, Revenge Porn, And the Impact on Women
[14] SiweiLyu and Yuezun Li, \"Exploring DF Videos Through the Identification of Face Warping Artefacts,\" arXiv:1811.00656v3.
[15] \"Exposing AI Created Fake Videos by Detecting Eye Blinking\" by Yuezun Li, Ming-Ching Chang, and SiweiLyu, arXiv:1806.02877v2.
[16] \"Using capsule networks to detect forged images and videos\" by Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen, arXiv:1810.11215.
[17] \"Deepfake Video Detection Using Recurrent Neural Networks,\" D. Guerra and E. J. Delp, 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 2018, pp. 1-6.
[18] J. Ba. Adam and D. P. Kingma: A stochastic optimization technique. 2014 Dec. arXiv:1412.6980.
[19] ResNext Model: retrieved from https://pytorch.org/hub/pytorch_vision_resnext/ April 6, 2020
[20] Software-engineering-cocoon-model: https://www.geeksforgeeks.org/ retrieved on April 15, 2020