Attention-based models: utilize attention mechanisms to learn the relationship between the input text and the output speech, enabling the model to dynamically focus on specific parts of the input sequence when generating each frame of the output speech.
Pros:
- No alignments needed
- Adaptable to diverse or noisy datasets
- Capable of more natural prosody
Cons:
- Less focused alignment due to attention drift
Examples:
- [[Tacotron2 architecture|Tacotron2]]
- [[Deep Voice 3]]
Duration-based models: explicitly model the duration of each phoneme or other unit in the input sequence. This duration information is then used to control the pacing and rhythm of the generated speech.
Pros:
- Fast parallel inference
- Less chance of alignment problems
- Easier to train if alignments are available
- More robust to silence in training data
Cons:
- Less flexibility in capturing complex relationships between input and output
- Require a duration model
Examples:
- [[FastSpeech architecture|FastSpeech]]
- [[FastSpeech2 architecture|FastSpeech2]]