GAN-based (Generative Adversarial Network-based) vocoders are a type of vocoder that utilizes a [[generative adversarial network (GAN)]] for generating high-quality speech waveforms. GANs consist of two neural networks, a generator and a discriminator, which are trained together competitively.
In the context of speech synthesis, GAN-based vocoders aim to generate realistic and natural-sounding speech waveforms by training the generator to produce speech waveforms that can deceive the discriminator. The discriminator, on the other hand, is trained to distinguish between real and generated speech waveforms.
The generator network in a GAN-based vocoder typically takes as input some form of low-level acoustic features, such as mel-spectrograms or vocoder parameters, and transforms them into speech waveforms. The generator learns to produce high-quality waveforms that resemble real speech by optimizing its parameters through an adversarial training process.
The discriminator network, often a binary classifier, is trained to differentiate between real speech waveforms and the synthesized waveforms generated by the generator. It learns to identify the characteristics of real speech and provide feedback to the generator, guiding it to produce a more realistic output.
During training, the generator and discriminator are trained iteratively. The generator aims to generate speech waveforms that fool the discriminator, while the discriminator aims to accurately discriminate between real and generated speech. This adversarial process encourages the generator to improve its synthesis quality, resulting in progressively more realistic speech waveforms.
Prominent examples of GAN-based vocoders include WaveGAN, Parallel WaveGAN, MelGAN, and [[HiFiGAN vocoder|HiFiGAN]].
Gan-based vocoders provide very fast parallel GPU and CPU speech synthesis with a quality close or matching [[WaveNet vocoder|WaveNet]]. They have the best quality vs latency trade-off. There are good open-source implementations such as the [[HiFiGAN vocoder|HiFiGAN]].