A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Robutrans: A robust transformer-based text-to-speech model

N Li, Y Liu, Y Wu, S Liu, S Zhao, M Liu - Proceedings of the AAAI …, 2020 - ojs.aaai.org
Recently, neural network based speech synthesis has achieved outstanding results, by
which the synthesized audios are of excellent quality and naturalness. However, current …

Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis

RJ Weiss, RJ Skerry-Ryan, E Battenberg… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
We describe a sequence-to-sequence neural network which directly generates speech
waveforms from text inputs. The architecture extends the Tacotron model by incorporating a …

Semi-supervised generative modeling for controllable speech synthesis

R Habib, S Mariooryad, M Shannon… - arXiv preprint arXiv …, 2019 - arxiv.org
We present a novel generative model that combines state-of-the-art neural text-to-speech
(TTS) with semi-supervised probabilistic latent variable models. By providing partial …

Diff-tts: A denoising diffusion model for text-to-speech

M Jeong, H Kim, SJ Cheon, BJ Choi, NS Kim - arXiv preprint arXiv …, 2021 - arxiv.org
Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded
in generating human-like speech, there is still room for improvements to its naturalness and …

Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis

R Valle, K Shih, R Prenger, B Catanzaro - arXiv preprint arXiv:2005.05957, 2020 - arxiv.org
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-
to-speech synthesis with control over speech variation and style transfer. Flowtron borrows …

Controllable neural text-to-speech synthesis using intuitive prosodic features

T Raitio, R Rasipuram, D Castellani - arXiv preprint arXiv:2009.06775, 2020 - arxiv.org
Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable
from natural speech. However, the prosody of generated utterances often represents the …

Delightfultts: The microsoft speech synthesis system for blizzard challenge 2021

Y Liu, Z Xu, G Wang, K Chen, B Li, X Tan, J Li… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper describes the Microsoft end-to-end neural text to speech (TTS) system:
DelightfulTTS for Blizzard Challenge 2021. The goal of this challenge is to synthesize …

[HTML][HTML] A review of deep learning based speech synthesis

Y Ning, S He, Z Wu, C Xing, LJ Zhang - Applied Sciences, 2019 - mdpi.com
Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more
attention. Recent advances on speech synthesis are overwhelmingly contributed by deep …

Deep voice 3: Scaling text-to-speech with convolutional sequence learning

W Ping, K Peng, A Gibiansky, SO Arik… - arXiv preprint arXiv …, 2017 - arxiv.org
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS)
system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in …