The traditional TTS pipeline approach includes the following sequence of steps:
- Training on a dataset of speech waveforms ($\mathcal{X}$) with transcriptions ($\mathcal{W}$):
- Extract the [[linguistic features]]: $\hat{\mathcal{L}} = \underset{\mathcal{L}}{\arg \max} \;P(\mathcal{L}|\mathcal{W})$
- Extract the [[acoustic features]]: $\hat{\mathcal{O}} = \underset{\mathcal{O}}{\arg \max} \;P(\mathcal{X}|\mathcal{O})$
- Learn the [[acoustic model]]: $\hat{\lambda} = \underset{\lambda}{\arg \max} \; p(\hat{\mathcal{O}}|\hat{\mathcal{L}},\lambda) p(\lambda)$
- Synthesize text $\mathbf{w}$
- Predict the [[linguistic features]]: $\hat{\mathbf{l}} = \underset{\mathbf{l}}{\arg \max} \;P(\mathbf{l}|\mathbf{w})$
- Predict the [[acoustic features]]: $\hat{\mathbf{o}} = \underset{\mathbf{o}}{\arg \max} \;p(\mathbf{o}|\hat{\mathbf{l}},\hat{\lambda})$
- Synthesize the waveform: $\mathbf{x} = p(\mathbf{x}|\hat{\mathbf{o}})$
![[heiga-zen-step-maxim.png]]
[Heiga Zen 2017](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45882.pdf)