Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

被引用次数：490 相关文章所有 3 个版本

[PDF] arxiv.org

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

被引用次数：155 相关文章所有 3 个版本

[PDF] arxiv.org

Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

Z Jiang, Y Ren, Z Ye, J Liu, C Zhang, Q Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …

被引用次数：52 相关文章所有 2 个版本

[PDF] arxiv.org

Foundationtts: Text-to-speech for asr customization with generative language model

R Xue, Y Liu, L He, X Tan, L Liu, E Lin… - arXiv preprint arXiv …, 2023 - arxiv.org

Neural text-to-speech (TTS) generally consists of cascaded architecture with separately
optimized acoustic model and vocoder, or end-to-end architecture with continuous mel …

被引用次数：9 相关文章所有 2 个版本

[PDF] aaai.org

Regeneration learning: A learning paradigm for data generation

X Tan, T Qin, J Bian, TY Liu, Y Bengio - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Abstract Machine learning methods for conditional data generation usually build a mapping
from source conditional data X to target data Y. The target Y (eg, text, speech, music, image …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

SH Lee, HY Choi, HS Oh, SW Lee - arXiv preprint arXiv:2307.16171, 2023 - arxiv.org

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems
still lack the ability to transfer the voice style of a novel speaker. In this paper, we present …

被引用次数：7 相关文章所有 6 个版本

[PDF] arxiv.org

A systematic exploration of joint-training for singing voice synthesis

Y Wu, Y Yu, J Shi, T Qian, Q Jin - arXiv preprint arXiv:2308.02867, 2023 - arxiv.org

There has been a growing interest in using end-to-end acoustic models for singing voice
synthesis (SVS). Typically, these models require an additional vocoder to transform the …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Timbre-reserved Adversarial Attack in Speaker Identification

Q Wang, J Yao, L Zhang, P Guo… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org

As a type of biometric identification, speaker identification (SID) systems face various
attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse …

被引用次数：2 相关文章所有 4 个版本

[PDF] mdpi.com

Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice

K Milewski, S Zaporowski, A Czyżewski - Electronics, 2023 - mdpi.com

The vulnerability of the speaker identity verification system to attacks using voice cloning
was examined. The research project assumed creating a model for verifying the speaker's …

被引用次数：5 相关文章所有 5 个版本

[PDF] arxiv.org

Source tracing: detecting voice spoofing

T Zhu, X Wang, X Qin, M Li - 2022 Asia-Pacific Signal and …, 2022 - ieeexplore.ieee.org

Recent anti-spoofing systems focus on spoofing detection, where the task is only to
determine whether the test audio is fake. However, there are few studies putting attention to …

被引用次数：7 相关文章所有 6 个版本

高级搜索

QQ 群