Statistical parametric speech synthesis (SPSS) is a proposed solution to the drawbacks of concatenative TTS. The basic idea is to generate acoustic parameters first, and then recover speech from these parameters using algorithms. SPSS typically consists of three components: a text analysis module, a parameter prediction module, and a vocoder analysis/synthesis module. The text analysis module processes the text and extracts linguistic features such as phonemes, duration, and POS tags. The acoustic model is trained with paired linguistic features and acoustic features, which are extracted from speech through vocoder analysis. The vocoders synthesize speech from the predicted acoustic features. SPSS has several advantages over previous TTS systems such as naturalness, flexibility, and low data cost. However, it also has drawbacks such as lower intelligibility and robotic voice quality. During the 2010s, significant advancements were made in neural networks and deep learning. These advancements led to the introduction of deep neural networks into SPSS. Notable examples include deep neural network (DNN)-based models and recurrent neural network (RNN)-based models. However, these models merely substituted the traditional hidden Markov models (HMMs) with neural networks while still predicting acoustic features from linguistic features, thereby adhering to the SPSS paradigm. Wang et al. (2016) proposed a groundbreaking approach in which they directly generated acoustic features from phoneme sequences. This pioneering work can be considered as the initial exploration of end-to-end speech synthesis. An SPSS system: ![[rasanen2020-spss-system.png]] SPSS feature generation: ![[rasanen2020-spss-feature-gen.png]] SPSS feature extraction and waveform generation: ![[rasanen2020-spss-extract-gen.png]] SPSS training: ![[rasanen2020-spss-training.png]] [Rasanen (2020)](https://wiki.aalto.fi/display/ITSP/Statistical+parametric+speech+synthesis) ## References Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis. In Sixth European Conference on Speech Communication and Technology, 1999. Heiga Zen. Generative model-based text-to-speech synthesis. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45882.pdf, 2017. Wenfu Wang, Shuang Xu, and Bo Xu. First step towards end-to-end parametric tts synthesis: Generating spectral parameters with neural attention. In Interspeech, pages 2243–2247, 2016.