4. Speech Signal Representations

This module is dedicated to the different forms in which the speech signal can be represented. Depending on the application and the level of detail required, some of the common ways to represent a speech signal are: 1. Time-domain representations: These representations show the changes in a feature of the speech signal over time. The feature can be samples of the sound pressure or a metric based on it computed over a short time interval. It is often represented as a waveform or an amplitude versus time plot. 2. Frequency-domain representations: These representations show the distribution of frequency components in the speech signal. It is often represented as a spectrum or a plot of amplitude versus frequency. 3. Spectrogram representation: This representation shows how the frequency content of the speech signal changes over time. It is often represented as a two-dimensional plot where time is shown on the x-axis, frequency on the y-axis, and the amplitude of the frequency component is shown as a color or shade. 4. Cepstral representation: This representation shows the frequency-domain features of the speech signal. It is often represented as a plot of the cepstral coefficients, which are derived from the Fourier transform of the logarithm of the magnitude spectrum. Some of the concepts presented in this module should be familiar to students who took Signals and Systems or Digital Signal Processing courses. ## Time-Domain ### Signals - [[continuous-time signal]] - [[discrete-time signal]] - [[sampling]] - [[amplitude quantization]] ### Periodicity - [[periodic signal]] - [[discrete-time complex exponential signal]] - [[fundamental frequency]] ### Short-Time Processing - [[windowing]] - [[frame]] - [[window length]] - [[hop length]] ### Time-Domain Features - [[root mean square (RMS)]] - [[zero-crossing rate]] - [[fundamental frequency]] - [[autocorrelation]] ## Frequency-Domain - [[discrete Fourier series (DFS)]] - [[discrete-time Fourier transform (DTFT)]] - [[discrete Fourier transform (DFT)]] - [[spectrum of a signal]] - [[short-time Fourier transform (STFT)]] - [[spectral leakage]] - [[spectral features of window functions]] ## Spectrogram - [[spectrogram]] - [[narrowband spectrogram]] - [[wideband spectrogram]] ## Mel-Frequency Cepstrum - [[cepstral analysis]] - [[cepstrum]] - [[mel-frequency spectrum]] - [[mel spectrogram]] - [[mel-frequency cepstrum]] - [[discrete cosine transform (DCT)]] - [[mel-frequency cepstral coefficient (MFCC)]] ## Problems - [[sign-m01 periodicity of cos]] - [[sign-m02 discrete-time sinusoid]] - [[sign-t01 periodicity of sin]] - [[sign-o01 find the period]] - [[wind-t01 windows main lobe]] - [[wind-t02 windows side lobe]] - [[wind-t03 Hamming window]] - [[wind-t04 Hanning window]] ## Readings ### [[Backstrom 2022]], Chapter 3: Basic Representations - [3.3. Waveform](https://speechprocessingbook.aalto.fi/Representations/Waveform.html) - [3.7. Autocorrelation and autocovariance](https://speechprocessingbook.aalto.fi/Representations/Autocorrelation_and_autocovariance.html) - [3.8. The cepstrum, mel-cepstrum and mel-frequency cepstral coefficients (MFCCs)](https://speechprocessingbook.aalto.fi/Representations/Melcepstrum.html) - [3.10. Fundamental frequency (F0)](https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html) - [3.11. Zero-crossing rate](https://speechprocessingbook.aalto.fi/Representations/Zero-crossing_rate.html) - [3.12. Deltas and Delta-deltas](https://speechprocessingbook.aalto.fi/Representations/Deltas_and_Delta-deltas.html) ### [[Jurafsky 2022]], Chapter 16: Automatic Speech Recognition and Text-to-Speech, [pdf](https://web.stanford.edu/~jurafsky/slp3/16.pdf) - 16.1 The Automatic Speech Recognition Task - 16.2 Feature Extraction for ASR: Log Mel Spectrum ## Optional Readings ### [[Huang 2001]], Chapter 5: Digital Signal Processing, [pdf](https://fenix.tecnico.ulisboa.pt/downloadFile/1970943312400268/SLP-chap5.pdf) - 5.1 Digital Signals and Systems - 5.2 Continuous-Frequency Transforms ### [[Huang 2001]], Chapter 6: Speech Signal Representations, [pdf](https://fenix.tecnico.ulisboa.pt/downloadFile/1970943312400325/SLP-chap6.pdf) - 6.1 Short-Time Fourier Analysis - 6.2 Acoustical Model of Speech Production - 6.3 Linear Predictive Coding - 6.4 Cepstral Processing - 6.5 Perceptually-Motivated Representations - 6.6 Formant Frequencies - 6.7 The Role of Pitch ## Additional Materials - Interpreting Speech Spectrograms [YouTube](https://youtu.be/yrPk19sHRMg)