In recent years, deep learning has revolutionized natural language processing and speech synthesis, enabling machines to narrate text with human-like expression and clarity. This paper presents a novel system titled Distinct Voices for Enhanced Storytelling using Deep Learning, which generates character-specific voice narration for textual stories. The approach combines quotation attribution, character identification, and zero-shot text-to-speech synthesis to automatically assign unique and expressive voices to individual characters in a story. Tools such as BookNLP are used to parse and annotate quotations, while state-of-the-art models like XTTS enable multilingual and emotion-rich voice generation. This enhances listener engagement, particularly in audiobooks and educational contexts, by transforming plain text into immersive, character-driven audio. The system is scalable, requires no manual voice labeling, and demonstrates significant potential in the fields of digital storytelling, accessibility, and human-computer interaction.
Introduction
1. Introduction
Storytelling has evolved from traditional print to digital audio formats such as audiobooks, podcasts, and virtual assistants. However, most narration systems still use a single monotone voice, which limits emotional depth and character diversity.
The proposed solution uses deep learning and natural language processing (NLP) to:
Attribute dialogue to characters
Assign distinct, expressive voices to each character
This system enhances digital storytelling, making it more engaging, accessible, and personalized.
2. Related Work
A. Quotation Attribution & Character Identification
BookNLP is a key tool for parsing literary texts and attributing dialogue.
Contextual embeddings and syntactic cues (e.g., Epure et al., Vishnubhotla et al.) improve attribution accuracy in complex narratives.
B. Voice Cloning & Zero-Shot TTS
Tools like XTTS and YourTTS allow zero-shot voice synthesis—no voice samples needed.
These systems support expressive, multilingual speech generation.
C. Multi-Speaker & Expressive TTS
Models like Tacotron, FastSpeech 2, VALL-E, and MParrotTTS enable high-quality, multi-character, emotion-aware voice synthesis.
3. Methodology
A. System Overview
The pipeline includes:
Quotation Attribution
Character Identification
Voice Assignment
TTS Generation
Audio Assembly
B. Quotation Detection
BookNLP detects who is speaking in dialogue.
Enhanced with contextual and syntactic techniques for greater accuracy.
C. Voice Assignment
Characters are grouped and assigned synthetic voices based on gender, age, or emotion using XTTS speaker embeddings.
D. Zero-Shot TTS
XTTS generates realistic, emotional, and multilingual voices without requiring training data for each character.
E. Audio Assembly
Merges narration and dialogue
Preserves flow, pacing, and clarity
F. Deployment Interface
Users can upload stories, preview characters, and download the final audio.
Suitable for education, entertainment, and assistive technology.
4. Experimental Results
A. Setup
Data: Text from classic novels like Alice in Wonderland and Sherlock Holmes
Tools: BookNLP, XTTS v2, Coqui TTS
Hardware: RTX 3060 GPU, Python-based pipeline
B. Metrics & Results
Metric
Result
Speaker Attribution Accuracy
89.3%
Mean Opinion Score (MOS)
4.35 / 5.0
Voice Differentiation Score
4.22 / 5.0
Avg. Processing Time
~35 sec / 1,000 words
Voice quality was rated highly expressive and natural.
Distinct voices improved listener comprehension and engagement, especially in complex dialogues.
C. User Feedback
90% preferred it over traditional single-voice narration.
Visually impaired users praised clarity and speaker differentiation.
Children showed higher engagement in educational content.
Conclusion
This paper presents a deep learning-based storytelling system that brings narratives to life by assigning distinct, expressive voices to each character. By integrating advanced natural language processing for speaker attribution with state-of-the-art text-to-speech synthesis using XTTS, the system transforms static literary text into engaging, multi-speaker audio.
The experimental results demonstrate high accuracy in character attribution and excellent voice quality, with strong differentiation across characters. Subjective feedback confirmed that listeners found the output more immersive and enjoyable compared to traditional single-voice narration. This approach holds significant promise for a variety of applications including audiobook generation, assistive reading tools, educational storytelling, and interactive fiction.
The system not only enhances the storytelling experience but also contributes to accessibility, particularly for visually impaired users. Its scalable, modular design enables it to adapt to different genres, languages, and user preferences, making it a versatile solution in the growing field of AI-generated content.
References
[1] S.Bamman, D., & Underwood, T. (2020). BookNLP: A natural language processing pipeline for novels. GitHub. https://github.com/booknlp/booknlp
[2] Epure, E., Hennequin, R., & Cerisara, C. (2024). Improving quotation attribution with fictional character embeddings. Findings of the Association for Computational Linguistics: EMNLP 2024. https://arxiv.org/abs/2406.11368
[3] Vishnubhotla, S., et al. (2023). Improving automatic quotation attribution in literary novels. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). https://aclanthology.org/2023.acl-short.64.pdf
[4] Casanova, E., et al. (2024). XTTS: A massively multilingual zero-shot text-to-speech model. Interspeech 2024. https://arxiv.org/abs/2406.04904
[5] Ruggiero, G., Zovato, E., Di Caro, L., & Pollet, V. (2021). Voice cloning: A multi-speaker text-to-speech synthesis approach based on transfer learning. arXiv preprint arXiv:2102.05630. https://arxiv.org/abs/2102.05630
[6] Casanova, E., et al. (2024). XTTS: Taking TTS to the next level. Coqui Blog. https://coqui.ai/blog/tts/xtts_taking_tts_to_the_next_level
[7] Zhang, Y., et al. (2019). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion. arXiv preprint arXiv:2112.02418. https://arxiv.org/abs/2112.02418
[8] Neekhara, P., et al. (2021). Expressive neural voice cloning. Proceedings of Machine Learning Research, 157, 1–12. https://proceedings.mlr.press/v157/neekhara21a/neekhara21a.pdf
[9] Reddy, V. K. (2023). Implementation of novel voice cloning method based on deep learning techniques. In Studies in Systems, Decision and Control (Vol. 571, pp. 239–252). Springer. https://link.springer.com/chapter/10.1007/978-3-031-75771-6_20
[10] Coqui.ai. (n.d.). ?TTS: TTS 0.22.0 documentation. https://docs.coqui.ai/en/stable/models/xtts.html
[11] Wang, Y., et al. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. https://arxiv.org/abs/1703.10135
[12] Ren, Y., et al. (2020). FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558. https://arxiv.org/abs/2006.04558
[13] Zhang, Y., et al. (2021). Boosting multi-speaker expressive speech synthesis with semi-supervised learning. arXiv preprint arXiv:2310.17101. https://arxiv.org/abs/2310.17101
[14] Ramesh, A., et al. (2021). VALL-E: Zero-shot text-to-speech synthesis. arXiv preprint arXiv:2301.02111. https://arxiv.org/abs/2301.02111
[15] Liu, J., et al. (2023). MParrotTTS: Multilingual multi-speaker text to speech synthesis in low-resource settings. arXiv preprint arXiv:2305.11926. https://arxiv.org/abs/2305.11926