The Encodec model was first presented in the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) (Défossez et. al., 2022). It is used in the [[VALL-E architecture]] to convert an audio signal into discrete codes. The paper presents encoded architecture in the following diagram: ![[Encodec model.png]] The Encodec architecture taken from the original paper The pipeline is the following 1. The input audio signal is a sequence of samples at 24/48 kHz, 16 bits/sample 2. The Encoder includes 1D convolution operations for downsampling and a two-layer LSTM for sequence modeling. 3. The output of the encoder is 75/150 latent timesteps with a depth dimension of 128. 4. The decoder is a mirror of the encoder, using transposed convolutions to upsample the latent space and re-construct the audio waveform. 5. The quantizer uses [[residual vector quantization (RVQ)]] to generate 8 indexes to 8 codebooks that represent the audio signal. 6. Mel spectrograms are created both from the input audio and the generated audio. 7. These spectrograms are compared and the signal from the comparison is used as a loss to direct the model training. 8. Several discriminators are used to compare a short-time Fourier transform (STFT) of the original and synthetic waveform to compute a GAN loss that complements the ones from the Mel spectrogram comparison. 9. The quantizer includes transformers that are used for additional compression of the audio signal. Another diagram of the Encodec model: ![[Encodec model-1.png]] Taken from: [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111) ## References - [\[2210.13438\] High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) - [High Fidelity Neural Audio Compression | Paper & Code Explained - YouTube](https://www.youtube.com/watch?v=mV7bhf6b2Hs.)