RNN-based models for speech synthesis refer to a class of neural network models that utilize a [[recurrent neural network (RNN)]] to generate the [[acoustic features]].
There are different variations of RNN-based models for speech synthesis, such as the traditional recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU). These variants of RNNs address the issue of vanishing gradients and allow for better modeling of long-term dependencies.
One advantage of RNN-based models for speech synthesis is their ability to capture temporal dependencies and generate smooth and continuous speech. They can model complex dynamics and prosodic features in speech, such as pitch, duration, and intonation. Additionally, RNN-based models allow for sequential generation, enabling real-time synthesis and dynamic control of the generated speech.
However, RNN-based models also have limitations. They can be computationally expensive to train and require sequential processing, making them slower for inference compared to [[transformer-based models]]. RNNs may struggle with modeling very long dependencies, and they can suffer from issues like gradient vanishing or exploding during training.
Examples of RNN-based models is the [[Tacotron architecture]] and [[Tacotron2 architecture]].