global style tokens (GST)

Global style tokens are a bank of embeddings that are jointly trained within Tacotron in an unsupervised manner and are used to model a large range of acoustic expressiveness in speech. This method to learn latent disentangled representations of high-dimensional data generates soft interpretable “labels” that can be used to control synthesis in novel ways, such as varying speed and modifying speaking style independently of the text content. The labels can also be used for style transfer, replicating the speaking style of one “seed” phrase across an entire long-form text corpus. ![[gst-model-diagram.png]] The GST model is based on the [[Tacotron architecture]]. ## Reference Wang, Yuxuan, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. ‘Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis’. arXiv, 23 March 2018. [https://doi.org/10.48550/arXiv.1803.09017](https://doi.org/10.48550/arXiv.1803.09017).