Speech Synthesis - learnius

In the field of speech and language processing, converting a message formulation into a speech signal waveform is a crucial task. This involves taking, for example, a string of letters or words and transforming them into an audible output that can be understood by human listeners. In this module, we will delve deeper into the techniques and methods used to achieve this task effectively. We will explore the various components involved in speech synthesis, including text analysis, prosody modeling, and signal processing. Additionally, we will explore the challenges associated with speech synthesis and how they can be addressed using advanced algorithms and machine-learning techniques. ## Fundamental Concepts ### Introduction - [[what is speech synthesis]] - [[text-to-speech synthesis (TTS)]] - [[concept-to-speech (CTS)]] - [[brain-to-speech]] - [[augmentative and alternative communication (AAC)]] ### Technologies - [[speech synthesis technologies]] - [[articulatory speech synthesis]] - [[formant speech synthesis]] - [[concatenative synthesis]] - [[statistical parametric synthesis (SPSS)]] - [[neural speech synthesis]] ### Evaluation - [[speech synthesis evaluation]] - [[subjective test]] - [[mean opinion score (MOS)]] - [[AB test]] - [[objective test]] - [[perceptual evaluation of speech quality (PESQ)]] - [[mel cepstral distortion (MCD)]] - [[word error rate (WER)]] ### Probabilistic Formulation - [[probabilistic formulation of TTS]] - [[acoustic model]] - [[acoustic features]] - [[linguistic features]] - [[traditional TTS pipeline approach]] ### TTS Front End - [[text analysis for TTS]] - [[text normalization]] - [[word segmentation]] - [[part-of-speech tagging]] - [[prosody prediction]] - [[grapheme-to-phoneme (G2P) conversion]] ### Acoustic Model - [[intermediate spectrogram]] - [[acoustic models in statistical parametric speech synthesis]] - [[text-to-spectrogram models]] - [[RNN-based models]] - [[Tacotron architecture]] - [[transformer-based models]] - [[FastSpeech architecture|FastSpeech]] - [[attention-based vs duration-based models]] ### Waveform Generation - [[vocoder]] - [[phase reconstruction]] - [[Griffin-Lim algorithm]] - [[WaveNet vocoder]] ### Speaker and Style Embeddings - [[speaker's voice]] - [[speaking style]] - [[personalized speech synthesis]] - [[voice cloning]] - [[cross-lingual voice cloning]] - [[latent space]] - [[learning speaker embeddings]] - [[learning style embeddings]] - [[global style tokens (GST)]] ## End-to-End Models - [[end-to-end speech synthesis]] - [[zero-shot voice cloning]] ## Advanced Topics ### Acoustic Model - [[VAE Tacotron2 architecture]] - [[transformer-based models]] - [[FastSpeech architecture|FastSpeech]] - [[FastSpeech2 architecture|FastSpeech2]] ### Waveform Generation - [[GAN-based vocoders]] - [[HiFiGAN vocoder|HiFiGAN]] ### End-to-End Models - [[normalizing flows]] - [[flow-based models]] - [[GlowTTS architecture]] - [[VITS architecture]] - [[YourTTS architecture]] ### Neural Codec Language Modeling - [[Encodec model]] - [[VALL-E architecture]] ## Readings ### [[Backstrom 2022]], Chapter 9: [Speech Synthesis](https://speechprocessingbook.aalto.fi/Speech_Synthesis.html) - [9.1. Concatenative speech synthesis](https://speechprocessingbook.aalto.fi/Synthesis/Concatenative_speech_synthesis.html) - [9.2. Statistical parametric speech synthesis](https://speechprocessingbook.aalto.fi/Synthesis/Statistical_parametric_speech_synthesis.html) ### [[Backstrom 2022]], Chapter 3: [Basic Representations](https://speechprocessingbook.aalto.fi/Representations/Representations.html) - [3.13. Pitch-Synchoronous Overlap-Add (PSOLA)](https://speechprocessingbook.aalto.fi/Representations/Pitch-Synchoronous_Overlap-Add_PSOLA.html?highlight=psola) ### [[Backstrom 2022]], Chapter 5: Modelling tools in speech processing - [5.8. Vocoder](https://speechprocessingbook.aalto.fi/Modelling/Vocoder.html) - [5.9. The Griffin-Lim algorithm: Signal estimation from modified short-time Fourier transform](https://speechprocessingbook.aalto.fi/Modelling/griffinlim.html) ### [[Backstrom 2022]], Chapter 6: Evaluation of speech processing methods - [6.1. Subjective quality evaluation](https://speechprocessingbook.aalto.fi/Evaluation/Subjective_quality_evaluation.html) - [6.2. Objective quality evaluation](https://speechprocessingbook.aalto.fi/Evaluation/Objective_quality_evaluation.html)