VALL-E has been trained on 60,000 hours of audio from the LibriLight dataset, which includes 7,000 different speakers (over 100 times more data than previous state-of-the-art models). Since the dataset is audio-only, an automatic speech recognition model was used for labeling. The Encodec model served as a pre-trained model. During training, random samples of 10-20 seconds were selected from LibriLight. An additional 3 seconds from the same utterance were used as the acoustic prompt. The model was trained using 16 Tesla V-100 GPUs, a relatively modest setup compared to large state-of-the-art language models. ## References - [\[2301.02111\] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)