![[zero-shot-tts.png]]
Speaker adaptation and speaker encoding approaches for voice cloning ([Arik2019](https://arxiv.org/abs/1802.06006))
Voice cloning is the process of creating a synthetic voice that mimics the speech patterns, intonation, and other characteristics of a specific human voice. This is typically achieved using machine learning and speech synthesis techniques that analyze a recording of the target voice and generate a synthetic version that sounds similar.
Zero-shot voice cloning is a technique for creating a synthetic voice without using any training data from the target speaker. Instead, it relies on a pre-trained generative model that has learned to synthesize speech generically, without any specific knowledge of individual voices.
To create a zero-shot voice clone, the generative model is conditioned on a set of textual prompts that describe the desired characteristics of the target voice, such as accent, tone, or speaking style. The model then generates synthetic speech that matches these characteristics, resulting in a synthetic voice that is similar to the target speaker's voice.
Zero-shot voice cloning has several advantages over traditional voice cloning techniques, which require large amounts of training data from the target speaker. For example, it can be used to create synthetic voices for speakers who are no longer alive or who have limited amounts of recorded speech available. It can also be used to create more personalized synthetic voices, as it allows users to specify the desired characteristics of the target voice.
The zero-shot voice cloning is also called zero-shot multispeaker TTS (ZS-TTS). This technique was first proposed in the paper "Neural Voice Cloning with a Few Samples" by a team of researchers from Baidu Research ([Arik2019](https://arxiv.org/abs/1802.06006)) by extending the DeepVoice 3 system. Later on, the Tacotron 2 was adapted using external speaker embeddings extracted from a trained speaker encoder using a generalized end-to-end loss (GE2E) to allow the resulting speech to resemble the target speaker.
YourTTS ([Casanova2021](https://arxiv.org/abs/2112.02418)) is one of the latest systems for zero-shot multispeaker. It is based on the VITS model and can be fine-tuned with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality.
![[yourtts-train-inference.png]]
YourTTS training and inference procedure ([Casanova2021](https://arxiv.org/abs/2112.02418)).