neural speech synthesis

With the emergence of deep learning, a new approach to speech synthesis called neural network-based TTS (neural TTS) has been developed. This method replaces the traditional HMM acoustic modeling with neural models in statistical parametric synthesis and utilizes deep neural networks as the backbone for speech synthesis. WaveNet was one of the first modern neural TTS models that directly generates waveform from linguistic features. Other models such as DeepVoice 1/2 upgrade the three components in statistical parametric synthesis with corresponding neural network-based models. Additionally, end-to-end models like Tacotron 1/2, Deep Voice 3, and FastSpeech 1/2 simplify text analysis modules and take character/phoneme sequences as input while simplifying acoustic features with mel-spectrograms. Fully end-to-end TTS systems such as ClariNet, FastSpeech 2s, and EATS have also been developed to generate waveform directly from the text. Compared to previous concatenative synthesis and statistical parametric synthesis-based the advantages of neural network-based speech synthesis include high voice quality in terms of both intelligibility and naturalness and less requirement on human preprocessing and feature development. ![[key-compon-neural-tts.png]] Key components in neural TTS ([Tan2021](https://arxiv.org/abs/2106.15561)) ## References Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.