From Text to Tune: An End-to-End AI Pipeline for Automated Music Composition

Authors: Shivhar Dhulshette, Arya Shende, Rohit Ingole, Prof. Varsha Kulkarni

DOI Link: https://doi.org/10.22214/ijraset.2025.70529

Abstract

We present an integrated system that generates complete musical compositions from textual inputs. Our approach leverages three distinct pretrained models: a text-to-text generation model to produce song lyrics, a text-to-speech (TTS) model to vocalize the lyrics, and a text-to-audio music generator to synthesize complementary background music. Finally, the system overlays the synthesized voice with the background music to produce a final song. Experimental demonstrations indicate that this modular pipeline can generate coherent vocal renditions accompanied by music that supports the song’s mood and style. This work highlights a flexible framework that may be extended to a range of creative and interactive music applications.

Introduction

Overview:
Recent advances in natural language processing and audio synthesis enable automated creation of complete songs by integrating lyrics, vocal synthesis, and instrumental music generation. This work presents a modular pipeline that, given a textual prompt (song topic or style), automatically generates lyrics, synthesizes vocals, produces background music, and mixes these components into a finished musical piece.

Key Components:

Lyric Generation: Uses a pretrained text-to-text model (e.g., LaMini-Flan-T5) to create creative and coherent song lyrics from a text prompt.
Voice Synthesis: Converts generated lyrics into natural-sounding vocal audio using a state-of-the-art text-to-speech (TTS) model (Bark), preserving emotional and prosodic cues.
Background Music Generation: Produces an instrumental track with a text-conditioned music generation model (MusicGen) that complements the song’s mood and style.
Audio Mixing: Combines vocals and background music using audio processing tools, balancing volumes to ensure clarity and coherence of the final MP3 output.

Implementation:

Developed in Python using Hugging Face Transformers, Bark, MusicGen, pydub, SciPy, and Gradio for an interactive web UI.
Modular design allows independent updates of each component.
User interface lets users input prompts, view lyrics, and download songs.

Experimental Results:

Generated songs are coherent with contextually relevant lyrics, natural vocals, and well-matched instrumental accompaniment.
Some limitations include occasional mismatch between music mood and lyrical emotion, indicating room for improved conditioning and joint optimization.

Future Directions:

Dynamic conditioning to better align music with lyrics.
Enhanced vocal expressiveness via prosody and emotion controls.
Incorporating user feedback for iterative song refinement.
Optimizing for near real-time generation to support interactive applications.

Conclusion

We have introduced a modular pipeline that unifies text-to-text generation, TTS, and music generation to create a complete AI-generated song. Our system demonstrates promising qualitative results by producing coherent lyrical content, natural vocal renditions, and complementary instrumental backgrounds. Future research will focus on refining the conditioning mechanisms and integrating user feedback to further enhance the musical quality. This work contributes a flexible framework for future explorations in AI-driven music generation.

References

[1] Borsos, C., et al. (2022). AudioLM: A Framework for Audio Generation. Retrieved from arXiv:2202.11446. [2] Dhariwal, P., et al. (2020). Jukebox: A Generative Model for Music. OpenAI. Retrieved from https://openai.com/blog/jukebox/. [3] Forsgren, A., &Martiros, M. (2022). Riffusion: Text-to-Audio Diffusion for Music Generation. Retrieved from https://github.com/riffusion/riffusion-app. [4] Hugging Face. (2023). Transformers Documentation. Retrieved from https://huggingface.co/docs/transformers/index. [5] MBZUAI. (2023). LaMini-Flan-T5-248M. Retrieved from https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M. [6] Suno AI. (2023). Bark: High-Fidelity Text-to-Speech Synthesis. Retrieved from https://github.com/suno-ai/bark. [7] Facebook AI. (2023). MusicGen: Generative Music from Text. Retrieved from https://github.com/facebookresearch/audiocraft. [8] Gradio. (2023). Gradio: Build Machine Learning Demos and Web Apps. Retrieved from https://gradio.app/. [9] The SciPy community. (2023). SciPy Documentation. Retrieved from https://docs.scipy.org/. [10] Pydub. (2023). Pydub Documentation. Retrieved from https://github.com/jiaaro/pydub.

Copyright

Copyright © 2025 Shivhar Dhulshette, Arya Shende, Rohit Ingole, Prof. Varsha Kulkarni. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70529

Publish Date : 2025-05-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here

A PHP Error was encountered