VAE Tacotron is a variant of the [[Tacotron2 architecture]] that uses a [[variational autoencoder (VAE)]] to include in the [[latent space]] prosody-related features at each input token (e.g. phonemes). The VAE generative framework enables the sampling of different prosody features for each phoneme at inference time providing a precise control of the prosody of synthesized speech. The prior over each latent variable is commonly modeled using a standard Gaussian distribution. ![[vae-ls-training.png]] [Alex Barron 2022](http://web.stanford.edu/class/cs224s/lectures/224s.22.lec16.pdf) ![[vae-ls-inference.png]] [Alex Barron 2022](http://web.stanford.edu/class/cs224s/lectures/224s.22.lec16.pdf) However, since the prior is independent of each phoneme, the generated audio often exhibits unnatural artifacts such as long pauses between syllables or sudden increases in energy or fundamental frequency. The use of a quantized representation in the latent space together allows the training of an autoregressive prior network to model the temporal dynamics across latent features. This improves the naturalness, while still ensuring reasonable diversity across samples. ![[sun2020-vae-encoder.png]] [Sun et al. (2020)](https://arxiv.org/pdf/2002.03788) ![[sun2020-vae-vector-quant.png]] [Sun et al. (2020)](https://arxiv.org/pdf/2002.03788) ## Reference Sun, Guangzhi, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, and Yonghui Wu. "Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and autoregressive prosody prior." In _ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)_, pp. 6699-6703. IEEE, 2020. [PDF](https://arxiv.org/pdf/2002.03788)