residual vector quantization (RVQ)

The residual vector quantization (RVQ) is a technique proposed in the paper [SoundStream: An End-to-End Neural Audio Codec](https://arxiv.org/abs/2107.03312) and used in the [[Encodec model]]. The residual vector quantization (RVQ) consists of projecting an input vector onto the closest entry in a codebook of a given size. For example, the [[VALL-E architecture]] uses a codebook with 1024 vectors, where each entry represents a vector of size 128. The quantization process works as follows: 1. The input vector to the closest vector in the first codebook (by Euclidean distance, for example). Once this mapping is done, the input vector can be represented by simply using the index of that vector in the codebook ($C_{1,1}$). 2. A residual vector is computed with the difference between the input vector and the vector in the codebook. 3. The residual vector is then quantized with a different 1024-vector codebook ($C_{2,3}$) and a new residual vector is computed. 4. The process is repeated 8 times 5. The final RVQ representation is the indices that were matched in each of the codebooks An example of a three-step RVQ quantization: ![[residual vector quantization (RVQ).png]] Taken from: [VALL-E — The Future of Text to Speech? | by Elad Rapaport | Towards Data Science](https://towardsdatascience.com/vall-e-the-future-of-text-to-speech-d090b6ede07b) This encoding method is extremely efficient. With eight 1024-entries codebooks $1024^{8}=1.2 \times10^{24}$ different vectors can be represented using only $1024 \times 8=8192$ numbers. ## References - [\[2107.03312\] SoundStream: An End-to-End Neural Audio Codec](https://arxiv.org/abs/2107.03312)