FastSpeech architecture

FastSpeech is a proposed solution that addresses several issues in [[neural speech synthesis]]. Firstly, it utilizes a feed-forward Transformer network to generate mel-spectrograms in parallel, significantly improving the speed of inference. Secondly, it eliminates the attention mechanism between text and speech, which helps prevent problems like word skipping and repetition, resulting in improved robustness. Instead, FastSpeech employs a length regulator that leverages a duration predictor to predict the duration of each phoneme. This regulator expands the phoneme hidden sequence to match the length of the mel-spectrogram sequence, enabling parallel generation. FastSpeech offers several benefits, including exceptionally fast inference speed, robust synthesis without skipping or repeating words, and comparable voice quality to previous autoregressive models. FastSpeech is similar to earlier TTS systems by explicitly predicting phoneme durations, energy and pitch. Durations for training can computed with an autoregressive model such as the [[Tacotron architecture|Tacotron]] or by using traditional HMMs in forced alignment. ![[ren-fastspeech1-arch.png]] [Ren et al 2019](https://arxiv.org/pdf/1905.09263.pdf) In the FastSpeech architecture the encoder and decoder are both fully parallel transformer blocks. The encoder and decoder are a stack of N Feed-forward transformer (FFT) blocks. Each block consists of a self-attention and 1D convolutional network. The self-attention network consists of a multi-head attention to extract the cross-position information. Different from the 2-layer dense network in Transformer, FastSpeech uses a 2-layer 1D convolutional network with ReLU activation. Following Transformer, residual connections, layer normalization, and dropout are added after the self-attention network and 1D convolutional network respectively. Between the encoder and decoder lies a length regulator to bridge the length gap between the phoneme and mel-spectrogram sequence. Based on the phoneme predicted duration, the length regulator repeats the encoder states so that the total length of the hidden states equals the length of the mel-spectrograms. The FastSpeech architecture was further improved in the [[FastSpeech2 architecture]]. ## Reference Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. In NeurIPS, 2019. [PDF](https://arxiv.org/pdf/1905.09263)