Prosody information in speech, which includes rhythm, stress, and intonation, is closely tied to variations in syllable duration, loudness, and pitch. These factors are critical for effective human communication. Predicting prosody requires the use of tagging systems that identify different types of prosody. Different languages have their unique systems and tools for this purpose. In English, the popular tagging system is [[ToBI]] (tones and break indices), which describes tags for tones (such as pitch accents, phrase accents, and boundary tones) and breaks (the strength of the pause between words). Many studies have investigated various models and features to predict prosody tags based on ToBI. In Chinese speech synthesis, typical prosody boundary labels include the prosodic word (PW), prosodic phrase (PPH), and intonational phrase (IPH), which can construct a three-layer hierarchical prosody tree.