attention-based vs duration-based models

Attention-based models: utilize attention mechanisms to learn the relationship between the input text and the output speech, enabling the model to dynamically focus on specific parts of the input sequence when generating each frame of the output speech. Pros: - No alignments needed - Adaptable to diverse or noisy datasets - Capable of more natural prosody Cons: - Less focused alignment due to attention drift Examples: - [[Tacotron2 architecture|Tacotron2]] - [[Deep Voice 3]] Duration-based models: explicitly model the duration of each phoneme or other unit in the input sequence. This duration information is then used to control the pacing and rhythm of the generated speech. Pros: - Fast parallel inference - Less chance of alignment problems - Easier to train if alignments are available - More robust to silence in training data Cons: - Less flexibility in capturing complex relationships between input and output - Require a duration model Examples: - [[FastSpeech architecture|FastSpeech]] - [[FastSpeech2 architecture|FastSpeech2]]