The vocoder was initially a system designed to reduce the bandwidth necessary to transmit intelligible voice. The name is a combination of the terms "voice coder-decoder" since the system decomposes the speech signal into separate components for the source and the energy on the different frequency bands. ![[dudley-vocoder-1940.png]] Homer Dudley (1940). In speech synthesis architecture, the vocoder (short for voice encoder) is the component responsible for generating speech waveforms from the [[acoustic features]]. One common approach used in vocoders is the [[source-filter model]]. It separates the speech signal into two components: the excitation signal (source) and the vocal tract filter. The excitation signal represents the periodic and aperiodic components of the speech waveform, while the vocal tract filter captures the resonances and spectral characteristics of the speech. There are various types of vocoders used in speech synthesis, including waveform concatenation-based vocoders, harmonic plus noise vocoders, and statistical parametric vocoders. Each type has its own strengths and limitations, and the choice of vocoder depends on the specific requirements and desired characteristics of the synthesized speech. In the [[probabilistic formulation of TTS]] the vocoder is responsible for the final stage of generating the synthesized waveform $\mathbf{x}$ from predicted [[acoustic features]] $\hat{\mathbf{o}}$: $ \mathbf{x} = p(\mathbf{x}|\hat{\mathbf{o}}) $ One of the main challenges of the vocoding process for [[statistical parametric synthesis (SPSS)]] and [[neural speech synthesis]] is the need to perform [[phase reconstruction]] to produce clear audio. ## References [A Brief History of the Vocoder](https://www.izotope.com/en/learn/a-brief-history-of-the-vocoder.html)