Video data plays an increasingly important role as digital evidence in areas such as surveillance systems, media verification, and forensic analysis. Nevertheless, the widespread availability of advanced video editing software has made it possible to perform complex temporal manipulations, including frame deletion, which can hide significant events and undermine the reliability of video evidence. Most existing frame deletion detection techniques are based on convolutional neural networks and handcrafted temporal descriptors. These approaches often exhibit limitations in modelling long-range temporal relationships and show reduced performance in low-motion scenes or under varying illumination conditions.To address these challenges, this paper proposes a Transformer-based temporal learning framework for reliable frame deletion detection in videos. The proposed method employs self-attention mechanisms to capture global temporal dependencies across video frames, allowing the identification of subtle temporal inconsistencies introduced by frame deletion. In contrast to conventional CNN-based techniques, the framework reduces dependence on frame differencing operations and manually defined statistical thresholds. By combining frame-level feature extraction with a temporal Transformer encoder, the proposed model enhances robustness across diverse motion patterns. This study demonstrates the potential of attention-driven temporal modelling in video forensics and establishes a scalable basis for future research in deep learning–based video manipulation detection.
Introduction
The paper addresses the growing challenge of detecting video frame deletion, a form of temporal manipulation used to conceal or alter events without introducing visible spatial artifacts. Traditional handcrafted and motion-based forensic methods are often unreliable in low-motion scenes, sensitive to noise, and dependent on manual thresholds. Although CNN-based deep learning approaches improve detection by learning spatio-temporal features, they are limited by fixed-length temporal windows and weak modelling of long-range temporal dependencies.
To overcome these limitations, the paper proposes a Transformer-based temporal learning framework for frame deletion detection. The framework combines lightweight CNN-based frame-level feature extraction with positional encoding and a temporal Transformer encoder. Self-attention mechanisms enable global temporal modelling across entire video sequences, allowing the system to detect subtle temporal inconsistencies even in static or visually consistent scenes.
The proposed pipeline includes frame extraction, preprocessing, CNN-based feature embedding, positional encoding, Transformer-based temporal modelling, and final classification. Unlike existing hybrid or CNN-based methods, the approach is fully data-driven and does not rely on handcrafted features or heuristic thresholds, improving robustness and adaptability across diverse video conditions.
Conceptual analysis indicates that the framework offers better generalization, reduced motion dependency, and improved reliability compared to existing methods. While large-scale experimental validation is left for future work, the study highlights Transformer-based temporal attention as a promising and scalable direction for advancing video frame deletion detection in real-world forensic applications.
Conclusion
This paper presented a Transformer-based temporal learning framework for reliable frame deletion detection in videos. By addressing the limitations of existing CNN-based and hybrid approaches, the proposed framework leverages self-attention mechanisms to capture global temporal dependencies across video sequences. This enables the detection of subtle temporal inconsistencies introduced by frame deletion, particularly in challenging scenarios involving low motion, static backgrounds, or visually consistent content.
The proposed framework integrates CNN-based frame-level feature extraction with temporal positional encoding and a Transformer encoder to model long-range temporal relationships in a fully learnable and data-driven manner. By reducing dependence on handcrafted temporal descriptors, frame differencing operations, and heuristic thresholds, the framework improves robustness and adaptability across diverse video conditions. The modular design also allows flexibility in selecting network components based on computational requirements and application constraints.
Although the present work focuses on conceptual framework design rather than extensive experimental evaluation, the proposed approach establishes a strong foundation for future research in video forensics. Future work will involve quantitative performance evaluation on benchmark datasets, frame-level localization of deletion points, and optimization for real-time deployment. Overall, this study highlights the potential of attention-driven temporal modelling as a scalable and effective direction for deep learning–based video manipulation detection.
References
[1] J. Hoogs, J. Wilkins, M. A. McCord, and K. A. Thompson, “A C3D-Based Convolutional Neural Network for Frame Dropping Detection in a Single Video Shot,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 1–8.
[2] V. Voronin, A. Zelensky, and S. Fedotov, “Detection of Deleted Frames on Videos Using a 3D Convolutional Neural Network,” in Proc. SPIE—Applications of Digital Image Processing XLI, vol. 10752, 2018, pp. 1–9.
[3] A. Kumar, S. Kansal, and M. S. Gaur, “Multiple Forgery Detection in Video Using Convolution Neural Network,” Multimedia Tools and Applications, vol. 81, no. 4, pp. 1–21, 2022.
[4] C. Gong, X. Liu, and Z. Wang, “IReF: Improved Residual Feature for Video Frame Deletion Forensics,” in Proc. International Conference on Digital Image and Signal Processing (ICDIS), 2022, pp. 1–6.
[5] Y. Su, H. Zhang, and L. Chen, “Velocity Field-Based Surveillance Video Frame Deletion Detection,” in Advances in Computer Vision and Pattern Recognition, Springer, 2024, pp. 1–15.
[6] Y. Xing, J. Li, and Z. Wang, “Inter-Frame Video Tampering Detection Based on Deep Convolutional Neural Networks,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 18, no. 2, pp. 1–21, 2022.
[7] L. Tan, Y. Zhang, and X. Li, “A Hybrid Deep Learning Framework for Object-Based Inter-Frame Video Forgery Detection,” Signal Processing: Image Communication, vol. 98, pp. 116–129, 2022.
[8] R. Girish, P. R. Kumar, and S. R. Mahadeva Prasanna, “Deep Learning-Based Inter-Frame Video Forgery Detection,” International Journal of Multimedia Information Retrieval, vol. 12, no. 1, pp. 1–14, 2023.
[9] Z. Akhtar, A. K. Singh, and P. Gupta, “Deep Learning-Based Detection and Localization of Inter-Frame Video Tampering,” Mathematics, vol. 12, no. 3, pp. 1–19, 2024.
[10] Y. Zhu, Q. Wang, and H. Li, “Video Inter-Frame Forgery Detection Based on CNN and Vision Transformer,” in Proc. SPIE—Media Watermarking, Security, and Forensics, vol. 12528, 2025, pp. 1–10.
[11] M. Ali, R. Khalid, and S. Hussain, “Inter-Frame Forgery Video Detection: Datasets, Methods, and Challenges,” Electronics, vol. 14, no. 2, pp. 1–27, 2025.
[12] J. Ceron, L. Verdoliva, and P. Bestagini, “Detecting Frame Deletion in Videos Using Supervised and Unsupervised Learning,” IEEE Transactions on Information Forensics and Security, vol. 19, pp. 1–15, 2024.