HiFiGAN vocoder - learnius

**HiFiGAN** is a generative adversarial network for speech synthesis. HiFiGAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance. The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https://paperswithcode.com/method/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module. For the discriminator, a multi-period discriminator (MPD) is used, consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https://paperswithcode.com/method/melgan) is used, which consecutively evaluates audio samples at different levels. ![[hifi-gan-vocoder.png]] [Kong et al (2020)](https://arxiv.org/pdf/2010.05646) ![[nvidia-hifigan.png]] [NVIDIA in PyNorch ](https://pytorch.org/hub/nvidia_deeplearningexamples_hifigan/) ## Reference Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. ‘HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis’. arXiv, 23 October 2020. [https://doi.org/10.48550/arXiv.2010.05646](https://doi.org/10.48550/arXiv.2010.05646).