The proliferation of synthetic audio generated by advanced generative models poses a significant threat to the integrity of digital communication systems. This study proposes a novel hybrid framework combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM) networks, and eXtreme Gradient Boosting (XGBoost) to detect audio DeepFakes effectively. CNNs extract spatial features from Mel-frequency cepstral coefficients (MFCCs), Bi-LSTMs capture temporal dependencies, and XGBoost serves as a final decision-level classifier. Experiments conducted on benchmark datasets demonstrate that the proposed system achieves an accuracy of 98%, along with high precision, recall, and robustness against unseen attacks. These results highlight that combining deep spatial–temporal feature learning with ensemble classification offers a strong and reliable solution for securing voice-based systems against DeepFake threats.
Introduction
Advancements in AI-driven speech synthesis have enabled the creation of highly realistic fake voices (DeepFakes), which pose serious risks such as:
Impersonation and fraud (e.g., in banking, authentication)
Manipulated audio evidence
Breakdown of trust in digital communications
Traditional voice verification systems struggle to detect these fakes due to their realism.
???? Goal: Develop a robust DeepFake detection system using a hybrid architecture combining CNN, Bi-LSTM, and XGBoost to detect subtle differences between real and synthetic voices.
II. Related Work
Traditional methods: Relied on handcrafted spectral features (e.g., MFCC, CQCC) + classical models (GMMs).
Deep learning: CNNs, LSTMs, and Bi-LSTMs showed better results on spectrograms and sequential speech features.
Ensemble methods: XGBoost and other boosted trees improve robustness and cross-dataset generalization.
Recent improvements:
Attention mechanisms
Transformer models for global context
Lightweight CNN-attention hybrids for edge devices
III. Proposed Methodology
A hybrid model integrating:
CNN for spatial & spectral feature extraction
Bi-LSTM for temporal dependency modeling in both directions
XGBoost brings ensemble power, refining final predictions.
Hybrid model outperforms all others and handles diverse DeepFake generation methods.
???? Key Takeaways from Model Evolution:
Transition from single-modality learning (CNN or RNN) → to multi-modal hybrid learning
Demonstrates that no single architecture is sufficient for robust detection
Best results come from combining spatial, temporal, and ensemble-based learning
Conclusion
This study presented a hybrid architecture combining Convolutional Neural Net-works (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and XGBoost for detecting DeepFake voice samples. By extracting Mel-Frequency Cepstral Coefficients (MFCC) from speech signals, the CNN was employed to capture spatial features, while the Bi-LSTM model learned temporal dependencies in the audio data. Finally, XGBoost served as a robust classifier, leveraging high-level deep features for final prediction. The proposed method achieved promising results in terms of accuracy, precision, recall, and F1-score, outperforming traditional classifiers and standalone deep learning models. This confirms the effectiveness of integrating deep temporal-spatial feature learning with powerful ensemble-based classification techniques for audio DeepFake detection.
In the future, this work can be extended by incorporating larger and more diverse multilingual datasets to improve robustness across different accents and languages. Exploring advanced feature representations such as spectrogram-based embeddings or self-supervised audio representations could further enhance detection performance. Additionally, integrating adversarial defense mechanisms can strengthen resilience against increasingly sophisticated DeepFake generation techniques. Finally, deploying the system in real-time applications, such as online meeting platforms or voice authentication systems, can make it highly valuable for practical and security-critical scenarios.
References
[1] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017.
[2] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A survey,” Speech Communication, vol. 66, pp. 130–153, Feb. 2015.
[3] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, T. Kinnunen, and J. Patino, “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, Mar. 2020.
[4] X. Liu, H. Delgado, M. Todisco, J. Patino, A. Nautsch, N. Evans, T. Kinnunen, K. A. Lee, and J. Yamagishi, “ASVspoof: Towards spoofed and deepfake speech detection — the 2021 challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, early access, 2022.
[5] H. Mewada, A. Patel, and D. Gamit, “Gaussian-filtered high-frequency-feature trained BiLSTM network for spoofed-speech detection on ASVspoof 2017,” Applied Sciences, vol. 13, no. 18, p. 10352, Sept. 2023.
[6] X. Chen, J. Zhang, and L. Chen, “Channel-robust synthetic speech detection system in ASVspoof 2021,” in Proc. Interspeech 2021, pp. 4264–4268, Aug. 2021.
[7] S. Chapagain, S. Shakya, and P. Adhikari, “Deep fake audio detection using a hybrid CNN-BiLSTM model with attention mechanism,” International Journal of Engineering and Technology (InJET), vol. 2, no. 2, pp. 45–54, Feb. 2025.
[8] R. K. Bhukya, “Machine-learning and deep learning models for ASVspoof 2021 deepfake detection,” Foundations and Trends® in Signal Processing, vol. 18, no. 1, pp. 1–62, 2025.
[9] M. K. M. Boussougou, S. Lee, and J. Kim, “Attention-based 1D CNN-BiLSTM hybrid model enhanced with hierarchical attention networks for Korean voice phishing detection,” Mathematics, vol. 11, no. 14, p. 3217, Jul. 2023.
[10] X. Wang, J. Yamagishi, M. Todisco, N. Evans, and T. Kinnunen, “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” arXiv:2408.08739, Aug. 2024.
[11] X. Wang, J. Yamagishi, M. Todisco, N. Evans, and T. Kinnunen, “ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,” arXiv:2502.08857, Feb. 2025.
[12] K. Jung, H. Tak, and S. Shon, “Selective attention-based recurrent neural networks for spoofed speech detection,” Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, 2019.
[13] A. R. Mirza, “Spoofing countermeasure for fake speech detection using class-imbalance solutions,” Speech Communication, vol. 160, p. 103006, May 2025.
[14] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, and J. Yamagishi, “Bottleneck features for spoofing detection: Analysis and improvements,” in Proc. Odyssey: The Speaker and Language Recognition Workshop, pp. 297–304, 2018.
[15] Z. Wu, X. Qian, and H. Li, “Spoofing detection using deep features and gradient boosting,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6389–6393, 2019.
[16] F. Yang, J. Zhang, and H. Wang, “Multi-scale information aggregation for spoofing detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, Article 57, Nov. 2024.
[17] S. Zhu, “An enhanced MIBKA-CNN-BiLSTM model for fake information detection,” Designs (MDPI), vol. 10, no. 9, Article 562, 2025.
[18] H. Li, Z. Zhang, and Y. Qian, “Transformer-based anti-spoofing for synthetic speech detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1234–1248, 2023.
[19] A. Singh and P. Kumar, “Lightweight CNN–attention hybrid model for real-time DeepFake voice detection,” Neural Computing and Applications, vol. 35, no. 18, pp. 12987–13005, 2023.