Deep fake videos pose a significant threat to security and misinformaGAN tion, leveraging advanced neural networks to manipulate visual content convincingly. This paper presents a deep fake video detection system utilizing a CNN-RNN architecture, combining InceptionV3 for feature extraction and GRU layers for sequence modeling. The model was trained and evaluated on a dataset with an imbalance of FAKE and REAL videos, achieving a validation accuracy of 81.25%. The implementation includes dropout layers and early stopping to prevent overfitting, with an Adam optimizer ensuring efficient convergence. Comparative analysis with existing unsupervised methods, including PRNU and noiseprint-based clustering, shows competitive accuracy. This study demonstrates the effectiveness of CNN- RNN architectures in detecting deep fake videos while highlighting the potential for future improvements using Transformer-based models and advanced attention mechanisms. The proposed approach provides a robust foundation for enhancing security measures against evolving deep fake technologies.
Introduction
Key points of the proposed system:
Architecture: Combines InceptionV3 for extracting high-level spatial features from individual video frames and stacked GRU layers to model temporal dependencies across frames, capturing subtle artifacts in video sequences.
Supervised Learning: Trained on labeled datasets (DeepFake Detection Challenge), achieving a validation accuracy of 81.25%, and showing better generalization than unsupervised clustering methods.
Advantages over Existing Methods: Unlike unsupervised PRNU/noiseprint approaches, the CNN-RNN method captures both spatial and temporal inconsistencies, reduces computational complexity, and is scalable for real-time applications.
Dataset Handling: Uses stratified sampling and preprocessing to manage imbalanced data (85% FAKE, 15% REAL), and proposes augmentation and noise-aware training to improve robustness against low-quality or compressed videos.
Future Improvements: Incorporating Transformers, attention mechanisms, super-resolution preprocessing, and more diverse datasets to enhance detection performance and generalization.
Summary: The CNN-RNN model provides an effective and scalable solution for deep fake video detection by combining spatial and temporal feature learning, outperforming traditional unsupervised methods while addressing real-world challenges like dataset imbalance and low-quality videos.
Conclusion
The proposed CNN-RNN architecture effectively addresses the challenges of deep fake video detection by leveraging InceptionV3 for spatial feature extraction and stacked GRU layers for temporal sequence modeling. By capturing both spatial and temporal features, the model demonstrates improved accuracy and generalization compared to existing unsupervised methods. The model achieved a validation accuracy of 81.25% while maintaining balanced precision and recall, showcasing its robustness in detecting manipulated videos. The use of dropout layers, batch normalization, and early stopping contributed to effective regularization and prevented overfitting, ensuring consistent performance across training and validation sets.
This approach highlights the advantages of supervised learning for deep fake detection, particularly in scenarios where labeled datasets are available. The proposed architecture provides a scalable and computationally efficient solution while maintaining competitive accuracy against state-of- the-art unsupervised methods. Future work includes exploring advanced attention mechanisms and Transformer-based models to enhance temporal sequence modeling further.
Additionally, expanding the dataset and experimenting with data augmentation techniques will enhance the model\'s robustness against emerging deep fake manipulation techniques. This research contributes to advancing deep fake detection technology, promoting digital security and trustworthiness in multimedia content. This work demonstrates that hybrid CNN–RNN models provide a practical and scalable defense against rapidly evolving deep fake generation techniques.
References
[1] Valsesia, D., Coluccia, G., Bianchi, T., & Magli, E. (2015). Large-scale image retrieval using compressed camera identification. IEEE Transactions on Multimedia, 17(9), 1439–1449.
[2] Qiao, T., et al. (2019). Statistical model-based detector with texture weight map: Application in resampling authentication. IEEE Transactions on Multimedia, 21(5), 1077–1092.
[3] Chen, B., Tan, W., Coatrieux, G., Zheng, Y., & Shi, Y. Q. (2021). Serial image copy-move forgery localization with source/target distinction. IEEE Transactions on Multimedia, 23, 3506–3517.
[4] Peng, F., Yin, L.-P., Zhang, L.-B., & Long, M. (2020).
CGR-GAN: Facial image regeneration for antiforensics using generative adversarial networks. IEEE Transactions on Multimedia, 22(10), 2511–2525.
[5] Zhao, Y., Zheng, N., Qiao, T., & Xu, M. (2019). Source camera identification using low-dimensional PRNU features. Multimedia Tools and Applications, 78(7), 8247– 8269.
[6] Yao, H., Xu, M., Qiao, T., Wu, Y., & Zheng, N. (2020).
Image forgery detection and localization using reliability fusion maps. Sensors, 20(22), 6668.
[7] Du, Y., Qiao, T., Xu, M., & Zheng, N. (2021). Face presentation attack detection with residual color texture representation. Security and Communication Networks, 2021, 1–16.
[8] Amerini, I., et al. (2017). Video source identification in social networks. Signal Processing: Image Communication, 57, 1–7.
[9] Singh, R. D., & Aggarwal, N. (2018). Comprehensive survey of video content authentication techniques. Multimedia Systems, 24(2), 211–240.
[10] Mandelli, S., Bestagini, P., Verdoliva, L., & Tubaro, S. (2020). Device attribution for stabilized video sequences. IEEE Transactions on Information Forensics and Security, 15, 14–27.
[11] Nguyen, T. T., et al. (2019). Deep learning methods for deepfake creation and detection. arXiv preprint arXiv:1909.11573.
[12] Rossler, A., et al. (2019). Faceforensics: Detecting manipulated facial images. Proceedings of the IEEE International Conference on Computer Vision, 1–11.
[13] Korshunov, P., & Marcel, S. (2018). Deepfakes: A new threat to facial recognition? Assessment and detection. arXiv preprint arXiv:1812.08685.
[14] Khodabakhsh, A., Ramachandra, R., Raja, K., Wasnik, P., & Busch, C. (2018). Generalization of fake face detection methods. Proceedings of the IEEE International Conference on Biometrics Special Interest Group, 1–6.
[15] Li, Y., & Lyu, S. (2019). Exposing deepfake videos by detecting face warping artifacts. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 46–52.
[16] Li, Y., Chang, M.-C., & Lyu, S. (2018). Exposing AI- generated fake videos by detecting eye blinking. Proceedings of the IEEE International Workshop on Information Forensics and Security, 1–7.
[17] Yang, X., Li, Y., & Lyu, S. (2019). Exposing deep fakes using inconsistent head poses. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 8261–8265.
[18] Fernandes, S., et al. (2019). Predicting heart rate variations in deepfake videos using neural ODE. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 1721–1729.