According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS) systems that use decision-tree clustered context-dependent hidden Markov models (HMMs). The feed-forward aspect of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to constrain the production of speech parameters. Parametric time-to-speech synthesis is used in this study by capturing the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells. Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.
Introduction
The study investigates the use of Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks (DBLSTM-RNNs) for improving Text-to-Speech (TTS) synthesis systems, especially in comparison to traditional HMM- and DNN-based approaches.
Key Insights and Contributions:
Traditional TTS Systems:
HMM-based TTS systems (often GMM-HMM) are compact and efficient for mobile use.
They suffer from over-smoothed trajectories, reducing speech naturalness.
Improvements like context modeling and state clustering help but don't fully solve this.
DNN-Based Systems:
DNNs outperform HMMs by better modeling the relationship between input text and acoustic features.
However, DNNs operate on static units (frames/states), limiting their ability to capture long-range temporal context.
Enhancements using techniques like Deep Belief Networks and Restricted Boltzmann Machines attempt to bridge this gap.
DBLSTM-RNN Approach:
Combines deep feed-forward and bidirectional LSTM RNNs to capture both past and future context in speech.
Allows for smoother and more natural speech trajectories without needing explicit dynamic constraints.
Uses rich linguistic and phonetic features for input (e.g., phone IDs, syllable stress, word position).
Produces static features (e.g., F0, gain, LSP) directly used by the vocoder to synthesize speech.
Training & Experimentation:
Data: 5 hours of American English female speech, sampled at 16 kHz.
Subjective evaluation: AB preference test with 60 participants.
Results:
Hybrid DBLSTM-RNN systems outperformed both HMM and DNN in:
Objective metrics (e.g., lower RMSE and LSD).
Subjective listener preference (~55%–59% preferred Hybrid over HMM/DNN).
Although computationally heavier and requiring more training resources, DBLSTM-RNNs yielded more natural and expressive synthesized speech.
Conclusion
This work aims to use BiLSTM-RNN in training a data driven model for the TTS. Nodes in the first or second internal processing layers of a DNN can be swapped out for bidirectional LSTM RNN nodes while maintaining the same amount of model parameters. The experiment\'s findings showing how the Hybrid BLSTM-RNN and DNN system performs better than both HMM and DNN in gathering complex details in a sentence. Our long-term goal is to study DBLSTM-RNN with a more comprehensive structure and a larger corpus.
References
[1] H. Zen, K. Tokuda, and W. Black, \"Statistical parametric speech synthesis,\" Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.
[2] K. Tokuda, T. Kobayashi, T. Masuko, and T. Kitamura, \"Speech parameter generation algorithms for HMM-based speech synthesis,\" in Proc. ICASSP, pp. 1315-1318, 2000.
[3] H. Zen, A. Senior, and M. Senior, \"Statistical parametric speech synthesis using deep neural networks,\" in Proc. ICASSP, pp. 8012-8016, 2013.
[4] Y. Qian, Y.-C. Fan, W.-P. Hu, and F. K. Soong, \"On the training aspects of deep neural network (DNN) for parametric TTS synthesis,\" in Proc. ICASSP, 2014.
[5] H. Lu, S. King, and O. Watts, \"Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,\" in Proc. 8th ISCA Workshop on Speech Synthesis, pp. 281-285, 2013.
[6] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, \"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,\" IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
[7] S. Kang, X. Qian, and H. Meng, \"Multi-distribution deep belief network for speech synthesis,\" in Proc. ICASSP, pp. 7962-7966, 2013.
[8] Z.-H. Ling, L. Deng, and D. Yu, \"Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis,\" in Proc. ICASSP, pp. 7825-7829, 2013.
[9] A. Graves, A.-R. Mohamed, and G. Hinton, \"Speech recognition with deep recurrent neural networks,\" in Proc. ICASSP, pp. 6645-6649, 2013.
[10] A. Graves, N. Jaitly, and A.-R. Mohamed, \"Hybrid speech recognition with deep bidirectional LSTM,\" in Proc. IEEE ASRU, pp. 273-278, 2013.
[11] S. Hochreiter and J. Schmidhuber, \"Long short-term memory,\" Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[12] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, \"Learning precise timing with LSTM recurrent networks,\" Journal of Machine Learning Research, vol. 3, pp. 115-143, 2003.
[13] M. Schuster and K. K. Paliwal, \"Bidirectional recurrent neural networks,\" IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[14] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, \"Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,\" in Proc. ICML, Pittsburgh, USA, 2006.
[15] A. Graves, \"Sequence transduction with recurrent neural networks,\" in ICML Representation Learning Workshop, 2012.
[16] Y. Bengio, \"Learning deep architectures for AI,\" Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
[17] K. Shinoda and T. Watanabe, \"MDL-based context-dependent sub-word modeling for speech recognition,\" J. Acoust. Soc. Jpn. (E), vol. 21, no. 2, pp. 79-86, 2000.
[18] Y.-J. Wu and R. H. Wang, \"Minimum generation error training for HMM-based speech synthesis,\" in Proc. ICASSP, 2006.
[19] C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and K. Shen, \"New methods in continuous Mandarin speech recognition,\" in Proc. EUROSPEECH, 1997.
[20] F. Seide, G. Li, X. Chen, and D. Yu, \"Feature engineering in context-dependent deep neural networks for conversational speech transcription,\" in Proc. IEEE ASRU, 2011. [Online]. Available: http://sourceforge.net/projects/currennt/
[21] H. Zen, \"Deep learning in speech synthesis,\" Proc. ISCA SSW8, 2013. [Online]. Available: http://research.google.com/pubs/archive/41539.pdf
[22] Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, \"USTC system for Blizzard Challenge 2006: An improved HMM-based speech synthesis method,\" in Proc. Blizzard Challenge Workshop, 2006.