Research Paper on Text to Audio Converter using NLP

Authors: Anant Sirohi, Akshay Yadav, Vani Rastogi

DOI Link: https://doi.org/10.22214/ijraset.2025.70467

Abstract

The development of text-to-speech (TTS) systems has advanced significantly with the introduc on of deep learning-based models. This paper inves gates the impact of various deep learning architectures, such as WaveNet and Tacotron 2, on the naturalness of synthesized speech. By leveraging convolu onal neural networks (CNNs) and recurrent neural networks (RNNs), we explore techniques for improving prosody, intona on, and speech quality. Our experiments show that the integra on of a en on mechanisms and vocoder models leads to more accurate and human-like speech output, par cularly in complex sentence structures. Addi onally, we examine the poten al of TTS systems in mul lingual and emo onal speech synthesis, showing promising results in genera ng speech with diverse accents and emo ons.

Introduction

I. Introduction

Text-to-Speech (TTS) systems convert written text into spoken words. While traditional methods used concatenation of pre-recorded speech units, modern systems rely on deep learning models that generate high-quality, natural-sounding speech.

This paper focuses on two prominent deep learning TTS models:

WaveNet (by DeepMind)
Tacotron 2 (by Google)

II. Background

Earlier TTS Approaches:
- Statistical parametric synthesis (e.g., HMMs) produced robotic and unnatural speech.
WaveNet (2016):
- A neural network that generates raw audio waveforms.
- Produced realistic speech but was computationally heavy.
Tacotron 2 (2017):
- A two-step model that first converts text to spectrograms, then uses WaveNet as a vocoder to convert the spectrogram into speech.
- Achieved high-quality, efficient synthesis.

III. Research Objectives

The paper aims to:

Study how deep learning improves TTS quality.
Evaluate Tacotron 2 and WaveNet in terms of audio quality, prosody, and real-time synthesis.
Explore applications in multilingual and emotion-infused speech synthesis.

IV. Related Work

Early systems: Festival, MBROLA.
Deep learning brought a paradigm shift.
WaveNet introduced realistic waveform generation.
Tacotron 2 combined sequence-to-sequence learning with WaveNet for improved quality.
Multilingual TTS: Recent models can synthesize speech in multiple languages using shared representations.

V. Methodology

Models: Tacotron 2 and WaveNet.
Datasets: LJSpeech (single speaker) and VCTK (multilingual).
Tools: TensorFlow, trained on NVIDIA V100 GPUs.

Tacotron 2 Architecture

Encoder: Converts text into phoneme representations.
Decoder: Generates spectrograms from encoded phonemes.
WaveNet Vocoder: Converts spectrograms into final audio.

WaveNet Architecture

Uses dilated convolutions to generate audio from raw input.

VI. System Workflow

Text-to-Speech Processing Flow:

Input text
Sentence segmentation
Tokenization
POS tagging
Entity and relationship detection
Chunking and final tagging
Spectrogram generation (Tacotron 2)
Audio synthesis (WaveNet)

VII. Results

A functional web interface was created.
Users can input text, select voice gender, and receive the synthesized audio output.
Output speech quality was natural-sounding and customizable.

Conclusion

This research shows that deep learning-based TTS models, especially WaveNet and Tacotron 2, are capable of producing highly natural-sounding speech. Furthermore, advancements in mul lingual and emo onal speech synthesis highlight the poten al for TTS systems to be applied in a wide range of applica ons, from virtual assistants to audiobook narra on. Future work will focus on improving real-me synthesis capabili es and exploring the use of emo on modeling in TTS systems to further enhance the expressiveness ofgenerated speech.

References

[1] Shen, J., et al. (2018). \"Tacotron 2: Genera ng Human-like Speech from Text.\" Proceedings of the 35th Interna onal Conference on Machine Learning arXiv:1609.03499. [2] Ren, Y., et al. (2019). \"FastSpeech: Fast, Robust, and Controllable Text to Speech.\" arXiv:1905.09263. [3] Ping, W., et al. (2018). \"ClariNet: Parallel Wave Genera on in End-to-End Text-to-Speech.\" arXiv:1807.07281. [4] Kim, J., et al. (2020). \"Mellotron: Towards Real-Time Expressive Speech Synthesis withTacotron.\"arXiv:2004.04452. [5] Yang, J., et al. (2020). \"Mul -Speaker Mul -Language Speech Synthesis with Tacotron.\" Proceedings of Interspeech 2020. [6] Kim, S., et al. (2020). \"Parallel WaveGAN: A fast waveform genera on model based on genera ve adversarial networks.\" IEEE Interna onal Conference on Acous cs, Speech, and Signal Processing (ICASSP 2020). [7] Liu, Y., et al. (2020). \"HiFi-GAN: Genera ve Adversarial Networks for E?cient and High-Quality Speech Synthesis.\" arXiv:2010.05646. [8] Liu, C., et al. (2020). \"VoiceTransformer Network: A Deep NeuralNetwork forText-to-SpeechwithMul–SpeakerandEmotionalVaria on.\"IEEE Transaconson Audio, Speech,andLanguageProcessing.

Copyright

Copyright © 2025 Anant Sirohi, Akshay Yadav, Vani Rastogi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70467

Publish Date : 2025-05-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here

A PHP Error was encountered