![[taxonomy-neural-tts.png]]
A taxonomy of neural TTS ([Tan2021](https://arxiv.org/abs/2106.15561))
![[data-flows-tts-systems.png]]
The data flow from text to waveform of different TTS systems ([Tan2021](https://arxiv.org/abs/2106.15561))
The use of end-to-end models for text-to-speech (TTS) synthesis is a relatively recent development but has gained significant traction in recent years due to advances in deep learning and neural network architectures.
Traditional TTS systems often involved a pipeline approach, where the input text was first processed by a text analysis module to generate linguistic features, which were then fed into a separate synthesis module to generate the corresponding speech waveform. However, this approach was often complex and required significant engineering effort to optimize each individual component.
In 2016, a breakthrough in end-to-end TTS synthesis was achieved by researchers at DeepMind, led by Heiga Zen and Yoshua Bengio. Wavenet the "[WaveNet](https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio)" model ([Oord 2016](https://arxiv.org/abs/1609.03499)), which used a deep convolutional neural network (CNN) to directly generate the speech waveform from discrete linguistic features. This approach showed significant improvements over previous methods and has since become a widely used approach for end-to-end TTS synthesis.
![[wavenet-architecture-2016.png]]
Wavenet architecture ([Oord 2016](https://arxiv.org/abs/1609.03499))
![[dilated-causal-convolutions-wavenet-2016.png]]
Wavenet's stack of causal convolution layers ([Oord 2016](https://arxiv.org/abs/1609.03499))
Since then, researchers have continued to refine and improve end-to-end models for TTS synthesis, using a variety of architectures such as autoregressive models, generative adversarial networks (GANs), and transformer networks. Today, end-to-end models are considered state-of-the-art in TTS synthesis and are widely used in applications such as virtual assistants, audiobooks, and automated voice response systems.
Some well-known end-to-end models for speech synthesis are:
- WaveNet: WaveNet is a deep neural network developed by researchers at Google in 2016. It is based on a deep convolutional neural network that directly generates raw audio samples, without the need for intermediate acoustic features.
- Tacotron: Tacotron is an end-to-end generative model for TTS developed by researchers at Google in 2017. It uses a sequence-to-sequence architecture with attention mechanism to generate speech directly from text input.
- Deep Voice: Deep Voice is a family of end-to-end models for speech synthesis developed by researchers at Baidu in 2017. It includes three main models: Deep Voice 1, 2, and 3, each with increasing levels of complexity and performance.
- Char2Wav: Char2Wav is a neural vocoder developed by researchers at NTT Corporation in 2018. It uses a deep neural network to directly generate speech waveforms from a sequence of text characters.
- MelGAN: MelGAN is a generative adversarial network (GAN) based model for speech synthesis developed by researchers at the Korea Advanced Institute of Science and Technology (KAIST) in 2019. It is capable of generating high-quality speech waveforms with high efficiency.
- FastSpeech 2s: A fully end-to-end TTS using [[generative adversarial network (GAN)]] modeling and a CNN/Self-attention architecture.
- VITS: A fully end-to-end TTS using [[auto-codificador variacional (variational autoencoder, VAE)]]+Flow modeling and a Self-attention/CNN/Hybrid architecture