The Tacotron2 architecture extends the earlier [[Tacotron architecture]].
It replaces the [[CBHG]] stacks and [[gated recurrent unit (GRU)]] with [[long short-term memory (LSTM)]] and convolutional layers in the encoder and decoder.
It also does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Also, location-sensitive attention is used instead of additive attention.
The Tacotron2 encoder maps every input grapheme to a 512-dimensional character embedding. To capture the broader letter context, a series of three convolutional layers is applied. Within this stack, there are 512 filters, each with a shape of 5 × 1, meaning that each filter covers a span of 5 characters. These convolutional layers process the input data, and the result is then fed into a bidirectional Long Short-Term Memory (biLSTM) layer. The biLSTM layer further refines the representation and generates the ultimate encoding.
The encoder output is fed to a [[location-sensitive attention]] block.
To enhance the decoding process, the previously predicted mel spectrum is fed through a compact pre-net, acting as a bottleneck. This pre-net output is then combined with the attention vector context from the encoder and passed through two LSTM layers. The resulting output from these LSTM layers serves two purposes. Firstly, it undergoes linear transformation and additional processing to autonomously predict a single frame of an 80-dimensional log-mel filter bank vector (representing 50 ms with a 12.5 ms stride) at each decoding step. Secondly, it is further processed through another linear layer, followed by a sigmoid function, to determine whether to generate a 'stop token' signaling the end of the output generation.
The training process uses gold log-mel filterbank features. This is achieved through a technique called teacher forcing, where the decoder is provided with the accurate log-mel spectral feature at each step, rather than relying on the predicted output from the previous step.
![[tacotron2-block-diagram.png]]
[Vainer et al. 2020](https://doi.org/10.48550/arXiv.2008.03802](https://doi.org/10.48550/arXiv.2008.03802)
The Tacotron 2 architecture includes a modified version of [[WaveNet vocoder]] which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.
Reference:
Vainer, Jan, and Ondřej Dušek. ‘SpeedySpeech: Efficient Neural Speech Synthesis’. arXiv, 9 August 2020. [https://doi.org/10.48550/arXiv.2008.03802](https://doi.org/10.48550/arXiv.2008.03802).