Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
Language models (LMs) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …
Generative adversarial networks (GANs) have rapidly emerged as powerful tools for generating realistic and diverse data across various domains, including computer vision and …
G Chen, Y Wu, S Liu, T Liu, X Du, F Wei - arXiv preprint arXiv:2308.12770, 2023 - arxiv.org
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its …
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on …
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state- of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research …