GenNarrate: AI-Powered Story Synthesis with Visual and Audio Outputs

Authors: Dr. Manimala S, Ananya U, Shreeram S R, Sudhiksha B A, Sanjana N

DOI Link: https://doi.org/10.22214/ijraset.2025.70567

Abstract

The emergence of generative artificial intelligence has redefined the boundaries of digital content creation, particularly in the domain of computational storytelling. This paper presents GenNarrate, a modular, multi-modal generative AI system engineered to synthesize coherent narratives augmented with corresponding visual and auditory elements. The architecture leverages advanced machine learning models, including LLaMA2 for text generation, DALL•E for image synthesis, and a combination of Google Text-to-Speech (GTTS) and AudioLDM for expressive audio narration and sound design. GenNarrate facilitates user-driven content generation by accepting configurable parameters-such as genre, tone, character elements, and desired multimedia outputs-through an interactive front-end interface. These inputs are orchestrated through a Flask-based backend pipeline, which integrates the constituent modules and produces downloadable outputs comprising narrated stories, image-enhanced documents, and synchronized audio tracks. The proposed system demonstrates a novel approach to narrative automation, emphasizing cross-modal coherence, scalability, and personalization. This study further situates GenNarrate within the broader context of AI-enhanced storytelling technologies, offering comparative insights with existing open-source models such as GPT-3 and Stable Diffusion. Potential applications are explored across educational content delivery, therapeutic interventions, creative industries, and interactive media. The findings underscore the transformative potential of multi-modal AI systems in facilitating immersive, user-centric storytelling experiences, while also identifying avenues for future development in real-time interaction, fine-grained customization, and adaptive content generation

Introduction

Overview

GenNarrate is a unified platform that enables the automated creation of immersive stories by combining text, image, and audio generation through generative AI models. Unlike traditional tools that handle each modality separately, GenNarrate provides end-to-end multimedia storytelling using a modular backend and intuitive user interface.

Key Features and Functionality

1. Multimodal Integration

Text Generation: Uses LLaMA2, a large language model, to generate coherent narratives based on user-defined parameters (e.g., genre, tone, character types).
Image Generation: Employs DALL·E and Stable Diffusion to create illustrations that visually match story scenes.
Audio Generation: Combines Google Text-to-Speech (GTTS) for narration with AudioLDM for ambient sounds and background music, enhancing immersion.

2. User Interface and Workflow

Built with React.js and Material UI, the web interface allows users to input story settings.
Outputs include a PDF storybook (with images and text) and a synchronized MP3 audio file.
The backend, powered by Flask, handles communication between modules and assembles the final content.

System Design and Architecture

Modular Components

Input Module: Collects user preferences (genre, number of scenes, tone).
Text Module: Generates structured story content using LLaMA2.
Image Module: Extracts visual prompts from the text and renders illustrations.
Audio Module: Converts text to speech and adds environmental audio layers.
Output Module: Compiles the multimedia content into downloadable formats.

Cross-Modal Coherence

Maintains narrative alignment across text, images, and audio through shared metadata and prompt engineering.
Uses NLP to segment narratives into chapters for synchronized transitions in visuals and sound.

System Requirements

Functional Requirements

Text generation based on detailed story parameters.
Scene-specific image creation.
Narration with background audio.
Downloadable output formats with real-time previews.

Non-Functional Requirements

Low latency, modular scalability, and fault tolerance.
Consistent tone and mood across modalities.
Accessibility support (screen readers, keyboard navigation).

Literature Review Highlights

Builds on advances in text (LLaMA2, StoryGenAI), image (DALL·E), and audio synthesis (WaveNet, AudioLDM).
Integrates insights from multimodal research (CLIP, GANs, narrative control models).
Addresses the limitations of siloed content creation by offering an end-to-end, automated storytelling solution.

Applications

GenNarrate has broad utility across:

Education (interactive learning),
Entertainment (customized storybooks, podcasts),
Therapy (narrative-based interventions),
Digital media production (automated storyboarding and content creation).

Conclusion

The GenNarrate platform represents a significant advancement in the domain of AI-powered content creation by unifying three complex modalities-text generation, image synthesis, and audio narration-into a seamless, user-centric storytelling experience. Unlike traditional story generators or single-modal generative tools, GenNarrate leverages state-of-the-art models such as LLaMA2, DALL•E, GTTS, and AudioLDM to construct richly immersive, personalized narratives based on userdefined parameters. The implementation achieves high cross-modal coherence by synchronizing inputs and outputs through a modular backend architecture built with Flask and a responsive frontend developed in React.js. By employing prompt engineering, automated text segmentation, scene extraction, and audio layering, the system ensures that each generated story is logically structured, visually engaging, and emotionally resonant. Beyond technical execution, GenNarrate illustrates the potential of AI as a co-creative agent. Its applications span a broad spectrum-from educational storytelling tools and therapeutic media to entertainment and content marketingdemonstrating its versatility across domains. The platform’s interactive interface, automated pipeline, and downloadable output formats offer an end-to-end solution that democratizes multimedia storytelling for users with no technical background. Although current capabilities are limited to static images, monolingual narration, and cloud-based deployment, future iterations of GenNarrate aim to incorporate multilingual support, video generation, neural voice cloning, and real-time content feedback. These enhancements will further extend the system’s usability, accessibility, and expressive range. In conclusion, GenNarrate exemplifies the power of generative AI when modular design, cutting-edge models, and thoughtful user experience come together. It sets a new benchmark in automated storytelling systems and lays the groundwork for continued research in multi-modal narrative synthesis, human-AI co-creativity, and adaptive content generation.

References

[1] S. Fotedar et al., “Storytelling ai: A generative approach to story narration,” CEUR Workshop Proceedings, vol. 2794, 2021. [Online]. Available: https://ceur-ws.org/Vol-2794/paper4.pdf [2] G. Kuznetsov et al., “Storytelling through deep learning,” CEUR Workshop Proceedings, 2019. [Online]. Available: https://ceur-ws.org/ Vol-2794/paper4.pdf [3] Ramesh et al., “Zero-shot text-to-image generation,” arXiv preprint arXiv:2102.12092, 2021. [Online]. Available: https://arxiv.org/abs/2102. 12092 [4] van den Oord et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499 [5] H. Liu et al., “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023. [Online]. Available: https://arxiv.org/abs/2301.12503 [6] R. E. Cardona-Rivera and D. L. Roberts, “Controlling narrative time in interactive storytelling,” in International Conference on Interactive Digital Storytelling, 2012. [Online]. Available: https: //www.researchgate.net/publication/221456514 [7] Radford et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021. [Online]. Available: https://arxiv.org/abs/2103.00020 [8] Goodfellow et al., “Generative adversarial networks,” arXiv preprint arXiv:1406.2661, 2014. [Online]. Available: https://arxiv.org/abs/1406. 2661 [9] Suraj et al., “A survey on the state of the art in audio generation models,” arXiv preprint arXiv:2005.00341, 2021. [Online]. Available: https://arxiv.org/abs/2005.00341 [10] L. P. Khan et al., “Storygenai: An automatic genre-keyword based story generation,” IEEE Xplore, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/10183482

Copyright

Copyright © 2025 Dr. Manimala S, Ananya U, Shreeram S R, Sudhiksha B A, Sanjana N. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70567

Publish Date : 2025-05-08

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here