Deep Voice 3 is a fully convolutional, attention-based neural text-to-speech system. The architecture of Deep Voice 3 consists of three main components: - Encoder: A fully convolutional encoder that transforms textual features into an internal learned representation. - Decoder: A fully convolutional causal decoder that uses a multi-hop convolutional attention mechanism to convert the learned representation into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner. - Converter: A fully convolutional post-processing network that predicts the final vocoder parameters (depending on the vocoder choice) from the decoder's hidden states. Unlike the decoder, the converter is non-causal and can therefore utilize future context information. ![[deep-voice-3-architecture.png.png]] The overall objective function to be optimized is a linear combination of the losses from both the decoder and the converter. The authors separate the decoder and converter and apply multi-task training because it facilitates easier learning of attention mechanisms in practice. Specifically, the loss for mel-spectrogram prediction guides the training of the attention mechanism, as attention is trained with gradients from both mel-spectrogram prediction and vocoder parameter prediction. Reference: - [\[1710.07654v3\] Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](http://arxiv.org/abs/1710.07654v3)