from text. The system is composed of a recurrent sequence-to-sequence feature prediction
network that maps character embeddings to mel-scale spectrograms, followed by a modified
WaveNet model acting as a vocoder to synthesize time-domain waveforms from those
spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a
MOS of 4.58 for professionally recorded speech. To validate our design choices, we present …