[[perceptual evaluation of speech quality (PESQ)]] < [[9-7 Speech Synthesis Evaluation]] > [[word error rate (WER)]] The **Mel Cepstral Distortion (MCD)** is an [[objective test]] metric used to quantify the spectral similarity between synthesized and reference speech. It measures the mean difference in Mel-frequency cepstral coefficients (MFCCs) between the two signals, providing an indication of the spectral distortion. The MCD is often used in speech synthesis evaluation, where the goal is to produce synthesized speech that sounds as close as possible to natural human speech. A lower MCD indicates a closer match between the spectral characteristics of the synthesized and reference speech, and therefore a better quality of synthesis. The calculation of MCD involves first extracting MFCCs from both the synthesized and reference speech signals. Then, the Euclidean distance between each corresponding pair of MFCCs is computed and averaged over all frames that do not correspond to silence. This gives the mean MCD value for the entire signal. $ \text{MCD}_{\text{dB}} = \frac{\alpha}{N}\sum_{t=0}^{N-1} \sqrt{ \sum_{k=1}^{P} (\text{MC}_{\text{syn}}(t,k) -\text{MC}_{\text{ref}}(t,k))^{2}} $ where $P$ is the number of MFCCs coefficients and $N$ the number of frames. The parameter $\alpha$ is defined as: $ \alpha = \frac{10\sqrt{2}}{\ln 10} = 6.14185 $ While the MCD is a useful metric for evaluating speech synthesis quality, it does not capture other aspects of speech such as prosody, intonation, or pronunciation accuracy. ## Reference Kominek, J., Schultz, T., Black, A.W. (2008) Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. Proc. Speech Technology for Under-Resourced Languages (SLTU-2008), 63-68 [PDF](https://www.cs.cmu.edu/~awb/papers/sltu2008/kominek_black.sltu_2008.pdf)