A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Adaspeech: Adaptive text to speech for custom voice

M Chen, X Tan, B Li, Y Liu, T Qin, S Zhao… - arXiv preprint arXiv …, 2021 - arxiv.org
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …

Review of end-to-end speech synthesis technology based on deep learning

Z Mu, X Yang, Y Dong - arXiv preprint arXiv:2104.09995, 2021 - arxiv.org
As an indispensable part of modern human-computer interaction system, speech synthesis
technology helps users get the output of intelligent machine more easily and intuitively, thus …

Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling

J Shen, Y Jia, M Chrzanowski, Y Zhang, I Elias… - arXiv preprint arXiv …, 2020 - arxiv.org
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model,
replacing the attention mechanism with an explicit duration predictor. This improves …

GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis

J Yang, JS Bae, T Bak, Y Kim, HY Cho - arXiv preprint arXiv:2106.15153, 2021 - arxiv.org
Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the
generation of reasonably good speech quality with a single model and made it possible to …

Accented text-to-speech synthesis with limited data

X Zhou, M Zhang, Y Zhou, Z Wu… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
This paper presents an accented text-to-speech (TTS) synthesis framework with limited
training data. We study two aspects concerning accent rendering: phonetic (phoneme …

Emovie: A mandarin emotion speech dataset with a simple emotional text-to-speech model

C Cui, Y Ren, J Liu, F Chen, R Huang, M Lei… - arXiv preprint arXiv …, 2021 - arxiv.org
Recently, there has been an increasing interest in neural speech synthesis. While the deep
neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to …

Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech

R Shah, K Pokora, A Ezzerg, V Klimkov… - arXiv preprint arXiv …, 2021 - arxiv.org
Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they
typically require a large amount of recordings from the target speaker. In previous work, a 3 …

Residual adapters for few-shot text-to-speech speaker adaptation

N Morioka, H Zen, N Chen, Y Zhang, Y Ding - arXiv preprint arXiv …, 2022 - arxiv.org
Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-
tuning most if not all of the parameters of a pretrained multi-speaker backbone model …

Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

A Gabryś, G Huybrechts, MS Ribeiro… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data
to generate high-quality synthetic speech. When using reduced amounts of training data …