VITS architecture

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of state-of-the-art deep learning techniques like [[generative adversarial network (GAN)]]s, [[variational autoencoder (VAE)]], [[normalizing flows]]. VITS does not require external alignment annotations and learns the text-to-audio alignment using [[Monotonic Alignment Search (MAS)]], as explained in the paper. The model architecture is a combination of the [[GlowTTS architecture]] encoder and [[HiFiGAN vocoder]] vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU. VITS is the backbone architecture of the [[YourTTS architecture]] multi-speaker and multi-lingual TTS model. ![[vits-architecture.png]] [Kim et al. (2021)](https://arxiv.org/pdf/2106.06103) VITS can be used in single-speaker and multi-speaker settings. The VITS formulation can be expressed in this equation: $ \log p_{\theta} (x|c) \ge \mathbb{E} _{q_{\phi}(z|x)} \left[ \log p_{\theta} (x|z) - \frac{q_{\phi} (z|x) }{p_{\theta} (z|c)} \right] $ where: - $z$ is the [[latent space]] vector - $c$ is an input phoneme sequence - $\log p_{\theta} (x|c)$ is the marginal likelihood of the data ## Reference Kim, Jaehyeon, Jungil Kong, and Juhee Son. ‘Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech’. arXiv, 10 June 2021. [https://doi.org/10.48550/arXiv.2106.06103](https://doi.org/10.48550/arXiv.2106.06103).