Flow-based models for speech synthesis are based on the concept of [[normalizing flows]], which are a set of transformations applied to a simple distribution, such as a Gaussian distribution, to obtain a more complex distribution.
Flow-based models combine transformer backbones with learned duration/attention. This way they capture the benefits of the fast parallel inference and high quality of [[FastSpeech architecture|FastSpeech]] with the flexibility of not needing alignments of [[Tacotron architecture|Tacotron]].
In the context of speech synthesis, flow-based models operate by learning a mapping between a low-dimensional [[latent space]] and the high-dimensional space of speech waveforms. This mapping allows the model to generate speech waveforms by sampling from the latent space and transforming the samples using the learned flow transformations.
The main advantage of flow-based models for speech synthesis is their ability to generate high-quality and natural-sounding speech. By modeling the data distribution directly, these models can capture the complex dependencies and structure present in speech waveforms. Additionally, flow-based models offer the flexibility to control various aspects of the generated speech, such as prosody, style, or speaker characteristics, by manipulating the latent space.
Training flow-based models for speech synthesis typically involve maximizing the likelihood of the training data. The model learns the flow transformations and the parameters of the latent space through an iterative optimization process. Once trained, the model can generate new speech waveforms by sampling from the latent space and applying the inverse flow transformations.
Examples of flow-based models are [[GlowTTS architecture]], [[VITS architecture]] and [[YourTTS architecture]].