neural speech synthesis (2016)

![[taxonomy-neural-tts.png]] A taxonomy of neural TTS ([Tan2021](https://arxiv.org/abs/2106.15561)) ![[data-flows-tts-systems.png]] The data flow from text to waveform of different TTS systems ([Tan2021](https://arxiv.org/abs/2106.15561)) The use of end-to-end models for text-to-speech (TTS) synthesis is a relatively recent development but has gained significant traction in recent years due to advances in deep learning and neural network architectures. Traditional TTS systems often involved a pipeline approach, where the input text was first processed by a text analysis module to generate linguistic features, which were then fed into a separate synthesis module to generate the corresponding speech waveform. However, this approach was often complex and required significant engineering effort to optimize each individual component. In 2016, a breakthrough in end-to-end TTS synthesis was achieved by researchers at DeepMind, led by Heiga Zen and Yoshua Bengio. Wavenet the "[WaveNet](https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio)" model ([Oord 2016](https://arxiv.org/abs/1609.03499)), which used a deep convolutional neural network (CNN) to directly generate the speech waveform from discrete linguistic features. This approach showed significant improvements over previous methods and has since become a widely used approach for end-to-end TTS synthesis. ![[wavenet-architecture-2016.png]] Wavenet architecture ([Oord 2016](https://arxiv.org/abs/1609.03499)) ![[dilated-causal-convolutions-wavenet-2016.png]] Wavenet's stack of causal convolution layers ([Oord 2016](https://arxiv.org/abs/1609.03499)) Since then, researchers have continued to refine and improve end-to-end models for TTS synthesis, using a variety of architectures such as autoregressive models, generative adversarial networks (GANs), and transformer networks. Today, end-to-end models are considered state-of-the-art in TTS synthesis and are widely used in applications such as virtual assistants, audiobooks, and automated voice response systems. Some well-known end-to-end models for speech synthesis are: - WaveNet: WaveNet is a deep neural network developed by researchers at Google in 2016. It is based on a deep convolutional neural network that directly generates raw audio samples, without the need for intermediate acoustic features. - Tacotron: Tacotron is an end-to-end generative model for TTS developed by researchers at Google in 2017. It uses a sequence-to-sequence architecture with attention mechanism to generate speech directly from text input. - Deep Voice: Deep Voice is a family of end-to-end models for speech synthesis developed by researchers at Baidu in 2017. It includes three main models: Deep Voice 1, 2, and 3, each with increasing levels of complexity and performance. - Char2Wav: Char2Wav is a neural vocoder developed by researchers at NTT Corporation in 2018. It uses a deep neural network to directly generate speech waveforms from a sequence of text characters. - MelGAN: MelGAN is a generative adversarial network (GAN) based model for speech synthesis developed by researchers at the Korea Advanced Institute of Science and Technology (KAIST) in 2019. It is capable of generating high-quality speech waveforms with high efficiency. - FastSpeech 2s: A fully end-to-end TTS using [[generative adversarial network (GAN)]] modeling and a CNN/Self-attention architecture. - VITS: A fully end-to-end TTS using [[auto-codificador variacional (variational autoencoder, VAE)]]+Flow modeling and a Self-attention/CNN/Hybrid architecture