Deepfake audio has emerged as a significant cyber security and social threat due to rapid advancements in artificial intelligence and speech synthesis technologies. Fraudulent audio generated using deep learning techniques can imitate human voices with high accuracy, making the identification of fake speech increasingly difficult, especially in noisy real-world environments. To address this challenge, this work proposes an AI-Based Deepfake Audio Detection with Noise-Aware Spectrogram Processing that enhances detection reliability under varying acoustic conditions.
The proposed model utilizes advanced audio preprocessing techniques to remove background noise and convert speech signals into spectrogram representations, enabling efficient extraction of temporal and frequency-domain features. A deep learning framework based on Convolutional Neural Networks (CNN) is employed to automatically learn discriminative patterns between genuine and manipulated audio samples. Noise-aware feature enhancement is integrated into the preprocessing stage to improve robustness against environmental disturbances and signal degradation.
The system is trained and evaluated on benchmark deepfake audio datasets containing both authentic and AI-generated speech samples with multiple noise levels. Experimental results demonstrate improved classification accuracy, reduced false detection rates, and enhanced generalization performance compared to conventional deepfake detection methods. The proposed framework can be effectively applied in voice authentication, digital forensics, cybersecurity, media verification, and fraud prevention systems.
Introduction
This project focuses on developing a Noise-Aware Deepfake Audio Detection System to identify synthetic or AI-generated speech in real-world environments. With advances in voice cloning, text-to-speech (TTS), and voice conversion technologies, deepfake audio has become increasingly realistic, creating risks such as identity fraud, fake emergency calls, political misinformation, and bypassing voice-based authentication systems. Existing detection methods often perform well in controlled environments but struggle with background noise, channel distortions, and large-scale deployment requirements.
The proposed system addresses these limitations by combining noise-aware training, spectrogram-based feature extraction, and a hybrid CNN-BiLSTM deep learning architecture. The model uses Mel-Frequency Cepstral Coefficients (MFCCs) and Mel-spectrograms to capture both spectral and temporal characteristics of speech. Noise augmentation is performed using samples from the MUSAN and UrbanSound8K datasets, enabling the model to remain robust under real-world acoustic conditions. Additionally, batch-based parallel inference allows simultaneous processing of multiple audio files, making the system suitable for large-scale applications.
The system is trained and evaluated using the ASVspoof 2019 Logical Access (LA) dataset, a widely used benchmark containing genuine and spoofed speech generated by multiple text-to-speech and voice conversion systems. Audio preprocessing includes validation, normalization, resampling, and noise reduction. Features extracted from MFCCs and Mel-spectrograms are processed through CNN layers for spatial feature learning and BiLSTM layers for temporal dependency modeling. The outputs are fused and passed through fully connected layers with a sigmoid classifier to determine whether an audio sample is real or fake.
A literature survey shows that previous studies employed CNNs, RNNs, LSTMs, transformer models, GANs, and ensemble learning methods for deepfake detection. While many achieved high accuracy in controlled settings, challenges related to noise robustness, multilingual data, scalability, and unseen attack types remain. The proposed approach seeks to overcome these challenges through noise-aware training and hybrid deep learning techniques.
The model employs ReLU, Leaky ReLU, Tanh, and Sigmoid activation functions to improve learning efficiency and classification performance. Training is guided by Binary Cross-Entropy Loss, optimized using the Adam optimizer with L2 regularization and early stopping to prevent overfitting. A novel chaos-based hyperparameter optimization method based on logistic maps is used to efficiently identify optimal network configurations.
Performance evaluation demonstrates excellent results. The proposed system achieves 98.28% accuracy, 97.82% precision, 98.87% recall (sensitivity), 97.77% specificity, 98.34% F1-score, and an AUROC of 0.9831. Error metrics are also low, with MAE = 0.017, RMSE = 0.131, and R² = 0.968, indicating strong predictive reliability. These results show that the proposed noise-aware CNN-BiLSTM framework can effectively detect deepfake audio even in challenging acoustic environments while maintaining scalability and computational efficiency.
Conclusion
This project presents a noise-aware deepfake audio detection system using a hybrid CNN- BiLSTM architecture to overcome limitations of existing methods in real-world environ- ments. The system integrates spectrogram-based feature extraction with noise-aware data augmentation and a parallel batch inference mechanism, enabling efficient and scalable multi-audio processing. The model was trained and evaluated on the ASVspoof 2019 Logical Access dataset, en- hanced with noise from MUSAN and UrbanSound8K datasets at different signal-to-noise ratios (0 dB, 10 dB, and 20 dB). Experimental results show that the proposed model achieves 98.3% accuracy and an AUROC of 0.983 under clean conditions, significantly outperform- ing the baseline CNN-BiLSTM. Even under severe noise (0 dB), the model maintains strong performance with 91.2% accuracy, demonstrating robustness compared to the baseline, which shows a larger performance drop. In addition to accuracy, the model improves prediction reliability, achieving lower error rates and better consistency in probability outputs. It significantly reduces both false positives and false negatives, ensuring more reliable detection of deepfake audio while minimizing incor- rect classifications. Compared to other machine learning models such as Logistic Regres- sion, SVM, Random Forest, and XGBoost, the proposed system consistently delivers supe- rior performance.
References
[1] K. Verma et al., \"Deepfake Audio Detection: A Comparative Study of Advanced Deep Learning Models for Synthetic Speech Detection,\" IEEE Access, vol. 13, pp. 1–16, 2025, doi: 10.1109/AC- CESS.2025.3611839.
[2] R. Bohara et al., \"Detecting Deepfake Audio Using Spectrogram-Based Deep Learning Models,\" IEEE Access, vol. 13, pp. 1–10, 2025, doi: 10.1109/ACCESS.2025.3602531.
[3] G. Ali et al., \"Ensemble Learning for Effective Voice Deepfake Detection,\" IEEE Access, vol. 12, pp. 149940–149959, 2024, doi: 10.1109/ACCESS.2024.3457866.
[4] O. A. Shaaban et al., \"Audio Deepfake Approaches: A Comprehensive Survey and Taxonomy,\" IEEE Access, vol. 11, pp. 132652– 132682, 2023, doi: 10.1109/ACCESS.2023.3333866.
[5] G. Lee et al., \"Dual-Channel Deepfake Audio Detection: Leveraging CNN-BiLSTM with Spec- trogram Features,\" IEEE Access, vol. 13, pp. 1–13, 2025, doi: 10.1109/ACCESS.2025.3532775.
[6] O. Ahmad et al., \"Deepfake Audio Detection for Urdu Language Using Deep LearningModels,\" IEEE Access, vol. 13, pp. 1–14, 2025.
[7] K. Zaman et al., \"Hybrid Transformer Architectures with Diverse Audio Features for Deepfake Detection,\" IEEE Access, vol. 12, pp. 1– 15, 2024.
[8] M. A. Cyril et al., \"A Hybrid CNN-LSTM Framework for Robust Deepfake Detection,\" IEEE, 2025.
[9] D. Song et al., \"Anomaly Detection of Deepfake Audio Based on Real and Synthetic Speech Features,\" IEEE Access, vol. 12, pp. 1–16, 2024.
[10] F. Alrowais et al., \"Boosting Deep Feature Fusion-Based Detection Model for Deepfake Audio,\" IEEE Access, vol. 12, pp. 1–14, 2024.
[11] D. Xiong et al., \"Enhancing Deepfake Detection Through BiLSTM and Multi-Feature Fusion,\" IEEE Access, vol. 13, pp. 1–10, 2025, doi: 10.1109/ACCESS.2025.3532775.
[12] K. S. Kammari et al., \"A Comprehensive Review of Deepfake Detection Techniques in Audio and Video,\" IEEE Access, vol. 13, pp. 1– 15, 2025.
[13] M. Owais et al., \"Deepfake Audio Detection in Low-Resource Languages,\" IEEE Access, vol. 14, pp. 1–15, 2026.