Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

Z Jiang, Y Ren, Z Ye, J Liu, C Zhang, Q Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …

Foundationtts: Text-to-speech for asr customization with generative language model

R Xue, Y Liu, L He, X Tan, L Liu, E Lin… - arXiv preprint arXiv …, 2023 - arxiv.org
Neural text-to-speech (TTS) generally consists of cascaded architecture with separately
optimized acoustic model and vocoder, or end-to-end architecture with continuous mel …

Regeneration learning: A learning paradigm for data generation

X Tan, T Qin, J Bian, TY Liu, Y Bengio - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Abstract Machine learning methods for conditional data generation usually build a mapping
from source conditional data X to target data Y. The target Y (eg, text, speech, music, image …

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

SH Lee, HY Choi, HS Oh, SW Lee - arXiv preprint arXiv:2307.16171, 2023 - arxiv.org
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems
still lack the ability to transfer the voice style of a novel speaker. In this paper, we present …

A systematic exploration of joint-training for singing voice synthesis

Y Wu, Y Yu, J Shi, T Qian, Q Jin - arXiv preprint arXiv:2308.02867, 2023 - arxiv.org
There has been a growing interest in using end-to-end acoustic models for singing voice
synthesis (SVS). Typically, these models require an additional vocoder to transform the …

Timbre-reserved Adversarial Attack in Speaker Identification

Q Wang, J Yao, L Zhang, P Guo… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
As a type of biometric identification, speaker identification (SID) systems face various
attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse …

Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice

K Milewski, S Zaporowski, A Czyżewski - Electronics, 2023 - mdpi.com
The vulnerability of the speaker identity verification system to attacks using voice cloning
was examined. The research project assumed creating a model for verifying the speaker's …

Source tracing: detecting voice spoofing

T Zhu, X Wang, X Qin, M Li - 2022 Asia-Pacific Signal and …, 2022 - ieeexplore.ieee.org
Recent anti-spoofing systems focus on spoofing detection, where the task is only to
determine whether the test audio is fake. However, there are few studies putting attention to …