Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org
The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

SH Lee, HY Choi, SB Kim, SW Lee - arXiv preprint arXiv:2311.12454, 2023 - arxiv.org
Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

S Ji, Z Jiang, H Wang, J Zuo, Z Zhao - arXiv preprint arXiv:2402.09378, 2024 - arxiv.org
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …

1M-Deepfakes Detection Challenge

Z Cai, A Dhall, S Ghosh, M Hayat, D Kollias… - arXiv preprint arXiv …, 2024 - arxiv.org
The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

N Kanda, X Wang, SE Eskimez, M Thakker… - arXiv preprint arXiv …, 2024 - arxiv.org
Laughter is one of the most expressive and natural aspects of human speech, conveying
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

H Guo, F Xie, K Xie, D Yang, D Guo, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
The long speech sequence has been troubling language models (LM) based TTS
approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a …

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

S Ji, Z Jiang, X Cheng, Y Chen, M Fang, J Zuo… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …