The process of concatenative synthesis involves piecing together recorded speech units from a database, ranging from syllables to whole sentences, to generate speech output. Many systems use the [[diphone]] as the smallest unit given its ability to capture the acoustic signal changes during the transition between two adjacent phones.
This technique is known for producing highly intelligible and authentic-sounding speech but requires an extensive recording database to cover all possible combinations of speech units for spoken words. However, the resulting voice may lack naturalness and emotional expressiveness due to potential inconsistencies in stress, prosody, and emotion resulting from concatenation. Some of these characteristics can be modified using techniques such as the [[pitch-synchronous overlap-add (TD-PSOLA)]].
![[hunt-unit-sel-costs.png]]
[Hunt 1996](http://www.era.lib.ed.ac.uk/bitstream/1842/1082/1/hunt+1996.pdf)
![[heiga-zen-concat-synth.png]]
[Heiga Zen 2017](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45882.pdf)
## References
Joseph Olive. Rule synthesis of speech from dyadic units. In ICASSP’77. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 568–570. IEEE, 1977.
Yoshinori Sagisaka, Nobuyoshi Kaiki, Naoto Iwahashi, and Katsuhiko Mimura. Atr μ-talk speech synthesis system. In Second International Conference on Spoken Language Processing, 1992.
Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996.
Alan Black, Paul Taylor, Richard Caley, and Rob Clark. The festival speech synthesis system, 1998.