WaveNet vocoder - learnius

The **WaveNet** is a generative model of audio that is able to generate high-quality speech from [[acoustic features]] such as an [[intermediate spectrogram]]. It was developed by researchers at Google DeepMind and was first introduced in 2016. The key innovation of WaveNet lies in its ability to model the conditional probability distribution of the next audio sample given a sequence of previous samples. Unlike traditional text-to-speech systems that generate speech based on predefined acoustic features, WaveNet directly models the raw audio waveform, making it capable of capturing fine details and producing more natural-sounding speech. The WaveNet is an autoregressive model that generates one sample of audio at a time. The original model produces audio as a sequence of 8-bit samples (μLaw amplitude encoding) $\mathbf{y} = (y_{1},\dots,y_{T})$ from an [[intermediate spectrogram]] $\mathbf{h} = (h_{1},\dots,h_{T})$. The probability of a waveform is: $ p(\mathbf{y}) = \prod_{t=1}^{T} P(y_{t}|y_{1},\dots,y_{t-1},h_{1},\dots,h_{t}) $ This probability distribution is modeled by a stack of convolution layers. In order to deal with long-range temporal dependencies needed for raw audio generation, this is done with dilated convolutions to achieve very a large receptive field. Dilated convolutions are a type of causal convolution layer. Causal convolution layers only look at present and past inputs and are particularly useful for autoregressive processing, where the prediction of the next output only relies on previous inputs. In dilated convolutions, the filter spans a larger range by skipping input values, allowing it to capture more context. For instance, with a dilation value of 2, a length-2 convolutional filter at the time $t$ would consider input values $x_t$ and $x_{t-2}$. The next figure shows the computation of the output at time $t$ with 4 dilated convolution layers with dilation values 1, 2, 4 and 8. ![[wavenet-layers.png]] [Oord et 2016](https://arxiv.org/pdf/1609.03499) For example, in the WaveNet model of the [[Tacotron2 architecture|Tacotron2]] there are a total of 12 convolutional layers organized into two cycles. Each cycle has a dilation cycle size of 6, which means that the first 6 layers have dilations of 1, 2, 4, 8, 16, and 32. Similarly, the next 6 layers also have dilations of 1, 2, 4, 8, 16, and 32. Since the output audio samples are encoded in 8-bits, the output of dilated convolutions is fed to a softmax to categorize the output in 256 levels. ![[heiga-zen-wavenet-arch.png]] [Heiga Zen 2017](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45882.pdf) The training of the WaveNet model is done separatly from the spectrogram predictor. After training, the spectrogram predictor is run in teacher-forcing mode with each predicted spectral frame conditioned on the encoded text input and the previous frame from the ground truth spectrogram. This produces a sequence of ground truth-aligned spectral features and gold audio output that is used to train the vocoder. The original WaveNet model is extremely slow and many different kinds of improvements were added to the original model. An important improvement is the paralelization of the model to avoid the latency of having to wait to generate each frame until the prior frame has been generated (autoregressive generation). The currently most efficient models are the [[GAN-based vocoders]] such as the [[HiFiGAN vocoder]] ## Reference Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. ‘WaveNet: A Generative Model for Raw Audio’. arXiv, 19 September 2016. [https://doi.org/10.48550/arXiv.1609.03499](https://doi.org/10.48550/arXiv.1609.03499).