Deepfake audio has emerged as one of the most concerning challenges in digital media authenticity, enabled by advances in deep learning and generative modelling. It has both positive applications in assistive technologies and dangerous implications for misinformation, impersonation, and fraud. Traditional supervised classification approaches often fail to generalize against new synthesis techniques. Anomaly detection methods, particularly those leveraging Generative Adversarial Networks (GANs), have shown promise in identifying deepfake audio by modeling authentic speech distributions. This paper presents a comprehensive survey of anomaly detection techniques applied to deepfake audio, with focus on GAN-based frameworks such as GANomaly and f-AnoGAN, and comparison with CNN and Autoencoder-based methods.
Introduction
Deepfake audio technology can generate highly realistic human speech, offering benefits for applications like virtual assistants and accessibility, but it also poses serious risks such as identity theft, voice fraud, impersonation, and misinformation. Detecting manipulated audio is challenging, especially because supervised methods depend on labeled datasets that quickly become outdated as synthesis techniques evolve. As a result, anomaly detection—particularly GAN-based approaches—has emerged as a promising solution by learning the characteristics of genuine speech and identifying deviations as potential deepfakes.
The literature highlights a wide range of detection methods. GAN-based models such as GANomaly and f-AnoGAN are effective for detecting unseen attacks without labeled fake data, though they face issues with robustness, training stability, and computational cost. CNN-based classifiers perform well on known deepfake techniques but struggle to generalize. Autoencoders, VAEs, transformers, self-supervised learning, contrastive learning, graph-based methods, and multimodal approaches each offer strengths in representation learning and robustness, but often trade off complexity, scalability, or real-time performance. Lightweight and real-time models aim to address deployment constraints, while challenges such as adversarial robustness, explainability, cross-lingual generalization, and overfitting to specific datasets remain active research areas.
The proposed methodology focuses on GAN-based anomaly detection, involving preprocessing, feature extraction (e.g., MFCCs or log-mel spectrograms), GAN training on genuine audio only, and anomaly scoring based on reconstruction error or latent-space discrepancies. Datasets like ASVspoof and WaveFake are commonly used for evaluation, including cross-dataset testing to assess generalization.
Conclusion
Deepfake audio detection remains a critical challenge as synthetic speech technologies advance rapidly. GAN-based anomaly detection frameworks offer a promising solution by learning the distribution of genuine speech and identifying manipulated audio as anomalies, without the need for extensive labelled fake data. Models like Generator and Discriminator demonstrate strong potential in adapting to unseen attacks and improving detection robustness. However, challenges such as training instability, reconstruction quality degradation, and limited cross-dataset generalization persist. Addressing these issues through improved architectures, diverse datasets, and hybrid learning strategies will be essential for creating scalable and reliable detection systems. Future work should focus on enhancing model interpretability, reducing computational overhead, and ensuring robustness against sophisticated adversarial attacks. By advancing GAN-based anomaly detection, researchers can contribute significantly to securing digital audio media against the growing threat of deepfake manipulation.