Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Avocodo: Generative adversarial network for artifact-free vocoder

T Bak, J Lee, H Bae, J Yang, JS Bae… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Neural vocoders based on the generative adversarial neural network (GAN) have been
widely used due to their fast inference speed and lightweight networks while generating …

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

H Barakat, O Turk, C Demiroglu - EURASIP Journal on Audio, Speech, and …, 2024 - Springer
Speech synthesis has made significant strides thanks to the transition from machine learning
to deep learning models. Contemporary text-to-speech (TTS) models possess the capability …

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …

Mscenespeech: A multi-scene speech dataset for expressive speech synthesis

Q Yang, J Zuo, Z Su, Z Jiang, M Li, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple
Scene Speech Dataset), which is intended to provide resources for expressive speech …

[PDF][PDF] Using a large language model to control speaking style for expressive tts

AT Sigurgeirsson, S King - Dialogue, 2023 - isca-archive.org
Large generative language models have been used to solve various language-related
tasks. We explore whether such models can suggest appropriate prosody for expressive …

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

S Zhang, A Mehrish, Y Li, S Poria - arXiv preprint arXiv:2501.06276, 2025 - arxiv.org
Speech synthesis has significantly advanced from statistical methods to deep neural
network architectures, leading to various text-to-speech (TTS) models that closely mimic …

Controllable Speaking Styles Using A Large Language Model

A Sigurgeirsson, S King - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different
renditions of the same target text. Such models jointly learn a latent acoustic space during …

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

T Bak, Y Eom, SJ Choi, YS Joo - arXiv preprint arXiv:2410.03192, 2024 - arxiv.org
Text-to-speech (TTS) systems that scale up the amount of training data have achieved
significant improvements in zero-shot speech synthesis. However, these systems have …