The Neural Codec Language Model of VALL-E contains two transformer models: - an autoregressive (AR) transformer that attends only to past data; - a non-autoregressive (NAR) transformer that attends to all points in time. The two transformers are shown in the following diagram: ![[VALL-E Neural Codec Language Model.png]] AR and NAR models taken from the original paper The output of the first quantizer of the [[Encodec model]] is computed by the AR model according to Equation 1. In this equation, the output of the first quantizer is conditioned on the input data, and on the previous timesteps’ outputs for the first quantizer (AR model): ![[VALL-E Neural Codec Language Model 1.png]] Where: - $C$ is the discrete audio code. - $\tilde{C}$ is the encoding of the acoustic prompt - $x$ is the encoding of the input text as phones - $c_{:,1}$ output from the first quantizer The equation shows that Equation 2 is the AR model where the output for each quantizer is conditioned on all of the timesteps from the previous quantizers (NAR model): ![[VALL-E Neural Codec Language Model 2.png]] The AR transformer is used to predict only $c_{:,1}$ which are the tokens for the first quantizer. While doing so, it attends to the previous tokens it has generated. The NAR transformer attends to the previous quantizers, and not to the previous timesteps (the previous tokens of the current quantizer are not available in the NAR model). ## References - [\[2301.02111\] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)