acoustic model - learnius

In speech synthesis, the acoustic model is a component of a [[text-to-speech synthesis (TTS)|TTS]] system that generates acoustic features from the text input. The acoustic model is trained on large datasets of recorded speech and their corresponding transcriptions. The acoustic model allows the separation of the training and inference processes. Using the [[probabilistic formulation of TTS]], the acoustic $\lambda$ can be included in the predictive distribution of the synthesized waveform $\mathbf{x}$, given the text $\mathbf{w}$ and the dataset of speech waveforms and transcription ($\mathcal{X}, \mathcal{W}$): $ p(\mathbf{x}|\mathbf{w},\mathcal{X},\mathcal{W}) = \int p(\mathbf{x}, \lambda|\mathbf{w},\mathcal{X},\mathcal{W}) \, d\lambda $ That can be decomposed in: $ p(\mathbf{x}|\mathbf{w},\mathcal{X},\mathcal{W}) = \int \underbrace{p(\mathbf{x}|\mathbf{w}, \lambda)}_{\text{inference}} \underbrace{p(\lambda|\mathcal{X},\mathcal{W})}_{\text{training}} \, d\lambda $