In January 2023 a team of Microsoft researchers released the paper [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111) introducing a language modeling approach for [[text-to-speech synthesis (TTS)]].
The previous [[end-to-end speech synthesis]] models, like [[VITS architecture|VITS]] and [[YourTTS architecture|YourTTS]], that mix speech and text require complex models to explicitly address issues such as alignment, speaker identity, and language. This complexity makes these models very difficult to train.
The approach proposed in the VALL-E architecture is to leverage the straightforwardness of generative language models and use them innovatively for speech generation.
The input to VALL-E is phonemicized text, and the output is the corresponding sound waveform. Additionally, VALL-E employs a prompting mechanism in which a 3-second audio sample is fed as additional input to the model. This allows the generation of a speech utterance of the input text conditioned on the given audio prompt. In practice, this means the ability to perform zero-shot speech generation, which is generating speech from a voice not seen in the training data.
The paper presents the VALL-E architecture in the following high-level diagram:
![[VALL-E architecture.png]]
VALL-E architecture taken from the original paper
The pipeline is the following:
1. The input text is converted into phones (indices to the phone vocabulary).
2. A phone embedding layer takes the vector of indices and outputs a matrix of embeddings corresponding to the input indices.
3. The 3-second acoustic prompt is fed into a pre-trained audio codec encoder (from the [[Encodec model]])
4. The encoder outputs a discrete audio representation by splitting it into fixed time windows and assigning each window a representation from a known vocabulary of audio embeddings
6. The model receives these two inputs and acts as an autoregressive language model to output the next discrete audio representation.
7. The predicted audio representations are transformed back into a waveform representation using the Decoder part of the [[Encodec model]].
The [[Encodec model]] converts audio into discrete codes using an encoder-decoder architecture with residual vector quantization. This makes the VALL-E text-to-speech model have an operational model similar to a language model such that it predicts the next discrete audio token for a given prompt, which consists of phonemicized text and audio input.
The prediction is done by the [[VALL-E Neural Codec Language Model]].
Open source version of VALL-E-X: [GitHub - Plachtaa/VALL-E-X: An open source implementation of Microsoft's VALL-E X zero-shot TTS model](https://github.com/Plachtaa/VALL-E-X)
## References
- [\[2301.02111\] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)
- [\[2210.13438\] High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
- [VALL-E — The Future of Text to Speech? | by Elad Rapaport | Towards Data Science](https://towardsdatascience.com/vall-e-the-future-of-text-to-speech-d090b6ede07b)