Realistic Video Synthesis from Audio using GAN

Authors: Sravanthi Karne, G. Narasimham

DOI Link: https://doi.org/10.22214/ijraset.2025.73064

Abstract

Realistic video generation from audio input is a challenging and emerging domain in the intersection of natural language processing, computer vision, and generative modeling. The ability to automatically generate coherent and visually compelling video content from raw audio has promising applications in media creation, virtual education, assistive technologies, and entertainment. Manual video creation remains time-consuming and skill-intensive, while automated solutions often lack semantic alignment and visual realism. To address this gap, this project proposes an end-to-end intelligent pipeline that synthesizes realistic video content from audio input using Generative Adversarial Networks (GANs). The system begins by transcribing the user’s audio using OpenAI’s Whisper ASR model, followed by extracting meaningful textual descriptions via a language model (e.g., Groq LLaMA or OpenAI GPT). The script is used to generate key search terms for visual content retrieval, sourcing high-quality imagery from the Pexels API. Speech is generated using Edge TTS, and synchronized subtitles are created.The images are compiled into a dynamic video using MoviePy, and visual quality is further enhanced using Real-ESRGAN for super-resolution. The final output is a short, high-resolution, contextually accurate video with natural narration and relevant imagery. This work demonstrates the effectiveness of combining audio processing, NLP, GAN-based enhancement, and open content APIs to automate realistic video generation from scratch.

Introduction

This project addresses the growing need for fast, automated, and high-quality video generation in an era dominated by short-form digital content. Traditional video creation involves multiple tools and expertise, limiting accessibility for casual users. The proposed system simplifies this by automating the process of turning audio input into fully composed, high-resolution videos.

Objectives

Automate video creation from audio to reduce manual effort.
Leverage advanced AI tools like Whisper (speech-to-text), Groq (script generation), Edge TTS (text-to-speech), and Real-ESRGAN (visual enhancement).
Synchronize visuals with narration for an immersive experience.
Make multimedia content creation accessible to educators and content creators.

System Architecture & Methodology

The pipeline includes:

Audio Input: User provides a .wav file.
Speech Recognition (Whisper): Transcribes audio with timestamps.
Script Generation (Groq API): Refines the transcript into a narrative.
Visual Retrieval (Pexels API): Fetches relevant images.
Video Composition (MoviePy): Combines audio, images, and subtitles.
Visual Enhancement (Real-ESRGAN): Upscales visuals for clarity.

Supporting Research

TA2V: Validates audio-text-video synthesis via multimodal AI.
IRE & SRGAN: Support high-quality image and video upscaling.
Audio PUGAN: Inspires audio synthesis with GAN-based techniques.

Implementation & Results

The system was tested on multiple audio samples, producing synchronized, visually appealing videos with minimal user input. Real-ESRGAN significantly improved video resolution. The integrated pipeline performed well in terms of quality, timing, and contextual accuracy.

Limitations

Relies on third-party APIs (e.g., Pexels).
Sensitive to noisy audio inputs.
Processing time may be high for high-resolution outputs.

Future Scope

Plans include:

Multilingual support.
Avatar-based video features.
Real-time previews and enhanced customization.

Conclusion

This project presents an automated end-to-end system for generating realistic and contextually accurate short videos from audio input. By integrating state-of-the-art tools such as Whisper for speech recognition, Groq/OpenAI for script generation, and Real-ESRGAN for visual enhancement, the system bridges voice-based input with coherent video synthesis. Each module—transcription, language modeling, image retrieval, TTS, and video assembly—contributes to a streamlined workflow requiring minimal user intervention. Designed for accessibility, the system simplifies content creation for users without technical expertise and is adaptable for applications in education, media, and infotainment. The results validate the feasibility of audio-to-video generation and highlight future potential in personalization, emotion-aware narration, and multilingual support.

References

[1] A. Radford et al., \"Whisper: A large-scale, weakly-supervised model for speech recognition,\" arXiv, 2022. Available: https://arxiv.org/abs/2212.04356 [2] C. Ledig et al., \"A GAN-based architecture for single-image super-resolution,\" in CVPR Proceedings, 2017, pp. 4681–4690. [3] X. Wang et al., \"Real-ESRGAN: A technique for improving low-quality images using adversarial training on synthetic examples,\" in ICCV Workshops, 2021. [4] Y. Wu et al., \"TA2V: Generating aligned video from audio and text using diffusion models,\" IEEE Transactions on Multimedia, 2024. [5] Groq Inc., \"Groq API for fast inference using LLMs (LLaMA/GPT),\" 2024. [Online]. Available: https://groq.com [6] •MoviePy Developers, \"MoviePy: A Python library for editing video programmatically,\" 2023. [Online]. Available: https://zulko.github.io/moviepy/ [7] OpenAI, \"Whisper: Open-source speech-to-text system,\" 2023. [Online]. Available: https://github.com/openai/whisper [8] Pexels, \"Pexels API: Access to royalty-free images and videos,\" 2023. [Online]. Available: https://www.pexels.com/api/

Copyright

Copyright © 2025 Sravanthi Karne, G. Narasimham. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET73064

Publish Date : 2025-07-09

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here