Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies …
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However …
R Xue, Y Liu, L He, X Tan, L Liu, E Lin… - arXiv preprint arXiv …, 2023 - arxiv.org
Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel …
Abstract Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (eg, text, speech, music, image …
SH Lee, HY Choi, HS Oh, SW Lee - arXiv preprint arXiv:2307.16171, 2023 - arxiv.org
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present …
Y Wu, Y Yu, J Shi, T Qian, Q Jin - arXiv preprint arXiv:2308.02867, 2023 - arxiv.org
There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the …
As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse …
The vulnerability of the speaker identity verification system to attacks using voice cloning was examined. The research project assumed creating a model for verifying the speaker's …
T Zhu, X Wang, X Qin, M Li - 2022 Asia-Pacific Signal and …, 2022 - ieeexplore.ieee.org
Recent anti-spoofing systems focus on spoofing detection, where the task is only to determine whether the test audio is fake. However, there are few studies putting attention to …