The traditional TTS pipeline approach includes the following sequence of steps: - Training on a dataset of speech waveforms ($\mathcal{X}$) with transcriptions ($\mathcal{W}$): - Extract the [[linguistic features]]: $\hat{\mathcal{L}} = \underset{\mathcal{L}}{\arg \max} \;P(\mathcal{L}|\mathcal{W})$ - Extract the [[acoustic features]]: $\hat{\mathcal{O}} = \underset{\mathcal{O}}{\arg \max} \;P(\mathcal{X}|\mathcal{O})$ - Learn the [[acoustic model]]: $\hat{\lambda} = \underset{\lambda}{\arg \max} \; p(\hat{\mathcal{O}}|\hat{\mathcal{L}},\lambda) p(\lambda)$ - Synthesize text $\mathbf{w}$ - Predict the [[linguistic features]]: $\hat{\mathbf{l}} = \underset{\mathbf{l}}{\arg \max} \;P(\mathbf{l}|\mathbf{w})$ - Predict the [[acoustic features]]: $\hat{\mathbf{o}} = \underset{\mathbf{o}}{\arg \max} \;p(\mathbf{o}|\hat{\mathbf{l}},\hat{\lambda})$ - Synthesize the waveform: $\mathbf{x} = p(\mathbf{x}|\hat{\mathbf{o}})$ ![[heiga-zen-step-maxim.png]] [Heiga Zen 2017](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45882.pdf)