Fastspeech: Fast, robust and controllable text to speech

Y Ren, Y Ruan, X Tan, T Qin, S Zhao… - Advances in neural …, 2019 - proceedings.neurips.cc
Neural network based end-to-end text to speech (TTS) has significantly improved the quality
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …

Libritts: A corpus derived from librispeech for text-to-speech

H Zen, V Dang, R Clark, Y Zhang, RJ Weiss… - arXiv preprint arXiv …, 2019 - arxiv.org
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …

ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit

T Hayashi, R Yamamoto, K Inoue… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …

Learning latent representations for style control and transfer in end-to-end speech synthesis

YJ Zhang, S Pan, L He, ZH Ling - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech
synthesis model, to learn the latent representation of speaking styles in an unsupervised …

[PDF][PDF] DurIAN: Duration Informed Attention Network for Speech Synthesis.

C Yu, H Lu, N Hu, M Yu, C Weng, K Xu, P Liu, D Tuo… - Interspeech, 2020 - isca-archive.org
In this paper, we present a robust and effective speech synthesis system that generates
highly natural speech. The key component of proposed system is Duration Informed …

Flow-TTS: A non-autoregressive network for text to speech based on flow

C Miao, S Liang, M Chen, J Ma… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
In this work, we propose Flow-TTS, a non-autoregressive end-to-end neural TTS model
based on generative flow. Unlike other non-autoregressive models, Flow-TTS can achieve …

The zero resource speech challenge 2019: TTS without T

E Dunbar, R Algayres, J Karadayi, M Bernard… - arXiv preprint arXiv …, 2019 - arxiv.org
We present the Zero Resource Speech Challenge 2019, which proposes to build a speech
synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without …

Speech synthesis with mixed emotions

K Zhou, B Sisman, R Rana… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Emotional speech synthesis aims to synthesize human voices with various emotional effects.
The current studies are mostly focused on imitating an averaged style belonging to a specific …

Almost unsupervised text to speech and automatic speech recognition

Y Ren, X Tan, T Qin, S Zhao… - … on machine learning, 2019 - proceedings.mlr.press
Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech
processing and both achieve impressive performance thanks to the recent advance in deep …

Durian: Duration informed attention network for multimodal synthesis

C Yu, H Lu, N Hu, M Yu, C Weng, K Xu, P Liu… - arXiv preprint arXiv …, 2019 - arxiv.org
In this paper, we present a generic and robust multimodal synthesis system that produces
highly natural speech and facial expression simultaneously. The key component of this …