The text-to-spectrogram models have distinct advantages over the acoustic models in [[statistical parametric synthesis (SPSS)]].
Firstly, conventional acoustic models necessitate alignments between linguistic and acoustic features. On the other hand, sequence-to-sequence neural models implicitly learn the alignments through attention or predict the duration jointly. This makes them more end-to-end and reduces the need for preprocessing.
Secondly, due to the increasing modeling power of neural networks, linguistic features are simplified into character or phoneme sequences while acoustic features have evolved from low-dimensional and condensed cepstrum to high-dimensional mel-spectrograms or even more high-dimensional linear-spectrograms.
One approach for text-to-spectrogram models is using [[RNN-based models]]. These models use an encoder-decoder with an attention framework. An example of such models is the [[Tacotron architecture|Tacotron]] and the [[Tacotron2 architecture|Tacotron2]].
There are two issues with RNN-based encoder-attention-decoder models like Tacotron 2. Firstly, due to their recurrent nature, both the encoder and decoder cannot be trained simultaneously, and the encoder cannot efficiently run in parallel during inference. This affects both the training and inference efficiency. Secondly, RNNs struggle to model long dependencies in text and speech sequences, which are often lengthy.
To address these issues, the RNN-based models were replaced by [[transformer-based models]], an example of which is the [[FastSpeech architecture]].