Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

被引用次数：145 相关文章所有 3 个版本

[PDF] arxiv.org

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

被引用次数：48 相关文章所有 2 个版本

[PDF] arxiv.org

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arXiv preprint arXiv …, 2023 - arxiv.org

Language models (LMs) have demonstrated the capability to handle a variety of generative
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

被引用次数：62 相关文章所有 3 个版本

[PDF] arxiv.org

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

被引用次数：52 相关文章所有 4 个版本

[PDF] arxiv.org

Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts

Z Jiang, J Liu, Y Ren, J He, C Zhang, Z Ye… - arXiv preprint arXiv …, 2023 - arxiv.org

Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous
large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …

被引用次数：21 相关文章所有 2 个版本

[PDF] iop.org Full View

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

T Chakraborty, UR KS, SM Naik, M Panja… - Machine Learning …, 2024 - iopscience.iop.org

Generative adversarial networks (GANs) have rapidly emerged as powerful tools for
generating realistic and diverse data across various domains, including computer vision and …

被引用次数：29 相关文章所有 7 个版本

[PDF] arxiv.org

Wavmark: Watermarking for audio generation

G Chen, Y Wu, S Liu, T Liu, X Du, F Wei - arXiv preprint arXiv:2308.12770, 2023 - arxiv.org

Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice
using just a few seconds of recording while maintaining a high level of realism. Alongside its …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Textrolspeech: A text style control speech corpus with codec language text-to-speech models

S Ji, J Zuo, M Fang, Z Jiang, F Chen… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS).
While previous studies have relied on users providing specific style factor values based on …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org

The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …

被引用次数：17 相关文章所有 2 个版本

高级搜索

QQ 群