A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Adaspeech: Adaptive text to speech for custom voice

M Chen, X Tan, B Li, Y Liu, T Qin, S Zhao… - arXiv preprint arXiv …, 2021 - arxiv.org
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …

Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings

E Cooper, CI Lai, Y Yasuda, F Fang… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can
produce good speaker similarity for speakers seen during training, there remains a gap for …

Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding

S Choi, S Han, D Kim, S Ha - arXiv preprint arXiv:2005.08484, 2020 - arxiv.org
On account of growing demands for personalization, the need for a so-called few-shot TTS
system that clones speakers with only a few data is emerging. To address this issue, we …

Leveraging unpaired text data for training end-to-end speech-to-intent systems

Y Huang, HK Kuo, S Thomas, Z Kons… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly
extracts intents from speech requires large amounts of intent-labeled speech data, which is …

Audio Anti-Spoofing Detection: A Survey

M Li, Y Ahmadiadli, XP Zhang - arXiv preprint arXiv:2404.13914, 2024 - arxiv.org
The availability of smart devices leads to an exponential increase in multimedia content.
However, the rapid advancements in deep learning have given rise to sophisticated …

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Z Jiang, J Liu, Y Ren, J He, Z Ye, S Ji… - The Twelfth …, 2024 - openreview.net
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts,
which significantly reduces the data and computation requirements for voice cloning by …

GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis

J Yang, JS Bae, T Bak, Y Kim, HY Cho - arXiv preprint arXiv:2106.15153, 2021 - arxiv.org
Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the
generation of reasonably good speech quality with a single model and made it possible to …

nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech

B Zhao, X Zhang, J Wang, N Cheng… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical
applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech …

Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

A Gabryś, G Huybrechts, MS Ribeiro… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data
to generate high-quality synthetic speech. When using reduced amounts of training data …