word error rate (WER)

The **Word Error Rate (WER)** is commonly used in speech recognition tasks but can also be applied to evaluate the accuracy of synthesized speech in text-to-speech systems. It measures the percentage of incorrectly recognized words in the synthesized output compared to the reference text. WER is calculated by dividing the total number of word errors by the total number of words in the reference text. The resulting value is then multiplied by 100 to obtain a percentage. $ \text{WER} = \frac{S + D + I}{N} \times 100\% $ where: - $S$ is the number of word substitutions - $D$ is the number of word deletions - $I$ is the number of word Insertions - $N$ is the number of words in the reference sentence For example, if the reference text contains 100 words and the synthesized output contains 10 words with errors, the WER would be: (10 / 100) x 100 = 10% A lower WER indicates a higher intelligibility of the synthesized speech.