FastSpeech2 architecture

FastSpeech 2 is an enhanced version of FastSpeech, focusing on two main improvements. Firstly, it replaces the use of distilled mel-spectrograms from an autoregressive teacher model with ground-truth mel-spectrograms as training targets. This simplifies the training process by eliminating the two-stage teacher-student distillation pipeline and avoids information loss in the target mel-spectrograms. Secondly, FastSpeech 2 incorporates additional variance information such as f0, duration, and energy as decoder input. This inclusion helps address the one-to-many mapping challenge in text-to-speech conversion. FastSpeech 2 surpasses FastSpeech in terms of voice quality while retaining the advantages of fast, robust, and controllable speech synthesis offered by FastSpeech. ![[ren-fastspeech2-arch.png]] [Ren et al (2021)](https://arxiv.org/pdf/2006.04558) The variance adaptor is designed to incorporate variance information, such as duration, pitch, and energy, into the phoneme hidden sequence. This additional information helps address the one-to-many mapping challenge in text-to-speech synthesis. The variance adaptor consists of a duration predictor (also used in FastSpeech), a pitch predictor, and an energy predictor. During training, ground-truth values of duration, pitch, and energy are used as inputs to predict the target speech. Simultaneously, the duration, pitch, and energy predictors are trained using the ground-truth values as targets. These predictors share a similar model structure comprising a 2-layer 1D-convolutional network with ReLU activation, layer normalization, dropout layer, and an additional linear layer for output projection. ## References Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021. [PDF](https://arxiv.org/pdf/2006.04558)