VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of state-of-the-art deep learning techniques like [[generative adversarial network (GAN)]]s, [[variational autoencoder (VAE)]], [[normalizing flows]].
VITS does not require external alignment annotations and learns the text-to-audio alignment using [[Monotonic Alignment Search (MAS)]], as explained in the paper. The model architecture is a combination of the [[GlowTTS architecture]] encoder and [[HiFiGAN vocoder]] vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU.
VITS is the backbone architecture of the [[YourTTS architecture]] multi-speaker and multi-lingual TTS model.
![[vits-architecture.png]]
[Kim et al. (2021)](https://arxiv.org/pdf/2106.06103)
VITS can be used in single-speaker and multi-speaker settings.
The VITS formulation can be expressed in this equation:
$
\log p_{\theta} (x|c) \ge \mathbb{E} _{q_{\phi}(z|x)} \left[ \log p_{\theta} (x|z) - \frac{q_{\phi} (z|x) }{p_{\theta} (z|c)} \right]
$
where:
- $z$ is the [[latent space]] vector
- $c$ is an input phoneme sequence
- $\log p_{\theta} (x|c)$ is the marginal likelihood of the data
## Reference
Kim, Jaehyeon, Jungil Kong, and Juhee Son. ‘Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech’. arXiv, 10 June 2021. [https://doi.org/10.48550/arXiv.2106.06103](https://doi.org/10.48550/arXiv.2106.06103).