Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arXiv preprint arXiv …, 2023 - arxiv.org
Language models (LMs) have demonstrated the capability to handle a variety of generative
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts

Z Jiang, J Liu, Y Ren, J He, C Zhang, Z Ye… - arXiv preprint arXiv …, 2023 - arxiv.org
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous
large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

T Chakraborty, UR KS, SM Naik, M Panja… - Machine Learning …, 2024 - iopscience.iop.org
Generative adversarial networks (GANs) have rapidly emerged as powerful tools for
generating realistic and diverse data across various domains, including computer vision and …

Wavmark: Watermarking for audio generation

G Chen, Y Wu, S Liu, T Liu, X Du, F Wei - arXiv preprint arXiv:2308.12770, 2023 - arxiv.org
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice
using just a few seconds of recording while maintaining a high level of realism. Alongside its …

Textrolspeech: A text style control speech corpus with codec language text-to-speech models

S Ji, J Zuo, M Fang, Z Jiang, F Chen… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS).
While previous studies have relied on users providing specific style factor values based on …

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org
The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …