Tacotron: Towards end-to-end speech synthesis

Y Wang, RJ Skerry-Ryan, D Stanton, Y Wu… - arXiv preprint arXiv …, 2017 - arxiv.org
A text-to-speech synthesis system typically consists of multiple stages, such as a text
analysis frontend, an acoustic model and an audio synthesis module. Building these …

[PDF][PDF] Tacotron: A fully end-to-end text-to-speech synthesis model

Y Wang, RJ Skerry-Ryan… - arXiv preprint …, 2017 - bengio.abracadoudou.com
ABSTRACT A text-to-speech synthesis system typically consists of multiple stages, such as a
text analysis frontend, an acoustic model and an audio synthesis module. Building these …

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …

Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis

R Valle, K Shih, R Prenger, B Catanzaro - arXiv preprint arXiv:2005.05957, 2020 - arxiv.org
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-
to-speech synthesis with control over speech variation and style transfer. Flowtron borrows …

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

Y Yasuda, X Wang, S Takaki… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
End-to-end speech synthesis is a promising approach that directly converts raw text to
speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with …

Semi-supervised training for improving data efficiency in end-to-end speech synthesis

YA Chung, Y Wang, WN Hsu, Y Zhang… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent
results, they typically require a sizable set of high-quality< text, audio> pairs for training …

Styletts: A style-based generative model for natural and diverse text-to-speech synthesis

YA Li, C Han, N Mesgarani - arXiv preprint arXiv:2205.15439, 2022 - arxiv.org
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech
owing to the rapid development of parallel TTS systems, but producing speech with …

Semi-supervised generative modeling for controllable speech synthesis

R Habib, S Mariooryad, M Shannon… - arXiv preprint arXiv …, 2019 - arxiv.org
We present a novel generative model that combines state-of-the-art neural text-to-speech
(TTS) with semi-supervised probabilistic latent variable models. By providing partial …

End-to-end adversarial text-to-speech

J Donahue, S Dieleman, M Bińkowski, E Elsen… - arXiv preprint arXiv …, 2020 - arxiv.org
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each
of which is designed or learnt independently from the rest. In this work, we take on the …

Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders

Y Liu, R Xue, L He, X Tan, S Zhao - arXiv preprint arXiv:2207.04646, 2022 - arxiv.org
Current text to speech (TTS) systems usually leverage a cascaded acoustic model and
vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer …