The intermediate spectrogram is the output of the [[acoustic model]].
This representation is based on the fact that [[phone]]s and [[prosody]] can be modeled with the magnitude of the spectrum without phase information. The phase information can be estimated by the [[vocoder]].
This frequency domain representation can incorporate our knowledge of the [[auditory transduction]] mechanism and use, for example, the [[mel spectrogram]].
The [[short-time Fourier transform (STFT)]] can handle chunks of speech with useful duration for phoneme and prosody modeling and can be efficiently computed with the [[fast Fourier transform (FFT)]].