Tacotron architecture

The Tacotron is [[text-to-speech synthesis (TTS)]] system that takes characters as input and outputs a linear spectrogram that uses the Griffin-Lim algorithm to generate the speech waveform. The backbone of Tacotron is an encoder-decoder model with attention. ![[tacotron-wang-2017.png]] [Wang et al 2017](https://arxiv.org/abs/1703.10135) The encoder takes a sequence of characters and produces a hidden representation ($\mathbf{h}_{j}$) of the letter sequence ($\mathbf{x}_{j}$). $ \{\mathbf{h}_{j}\}_{j=1}^{L} = \text{Encoder}(\{\mathbf{x}_{j}\}_{j=1}^{L}) $ On the right side the attention RNNs takes the last frames ($\mathbf{y}_{i-1}$) to generate the output of the encoder ($\mathbf{s}_{i}$) $ \mathbf{s}_{i} = \text{RNN}_{Att}(\mathbf{s}_{i-1}, \mathbf{c}_{i-1}, \mathbf{y}_{i-1}) $ where $\mathbf{c}_{i}$ is the output of the attention block $ \mathbf{c}_{i} = \sum_{j} \alpha_{i,j} \mathbf{h}_{j} $ and $ \mathbf{\alpha}_{i} = \text{Attention}(\mathbf{s}_{i, \dots}) $ The output of decoder RNNs $\mathbf{d}_{i}$: $ \mathbf{d}_{i} = \text{RNN}_{Dec} (\mathbf{d}_{i-1},\mathbf{c}_{i}, \mathbf{s}_{i}) $ is used to compute the output frame $ \mathbf{y}_{i} = f_{0}(\mathbf{d}_{i}) $ ![[alex-barron-tacotron.png]] [Alex Barron 2022](http://web.stanford.edu/class/cs224s/lectures/224s.22.lec16.pdf) The [[CBHG]] block (Convolutional filters Bank, Highway network, bidirectional GRU RNNs) is a module to extract representations from sequences inspired from work in machine translation. This version of Tacotron is commonly referred to as _Tacotron 1_ and was later extended to the [[Tacotron2 architecture]] that uses mel-spectrograms and WaveNet vocoder. ## Reference Wang, Yuxuan, R.J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, et al. ‘Tacotron: Towards End-to-End Speech Synthesis’. In _Interspeech 2017_, 4006–10. ISCA, 2017. [https://doi.org/10.21437/Interspeech.2017-1452](https://doi.org/10.21437/Interspeech.2017-1452).