A $^ 3$ T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：109 相关文章所有 6 个版本

[PDF] arxiv.org

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

被引用次数：390 相关文章所有 3 个版本

[PDF] neurips.cc

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc

Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

被引用次数：123 相关文章所有 8 个版本

[PDF] arxiv.org

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

被引用次数：41 相关文章所有 2 个版本

[PDF] arxiv.org

Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

Z Jiang, Y Ren, Z Ye, J Liu, C Zhang, Q Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …

被引用次数：38 相关文章所有 2 个版本

[PDF] neurips.cc

P-flow: a fast and data-efficient zero-shot TTS through speech prompting

S Kim, K Shih, JF Santos… - Advances in …, 2024 - proceedings.neurips.cc

While recent large-scale neural codec language models have shown significant
improvement in zero-shot TTS by training on thousands of hours of data, they suffer from …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Generative pre-training for speech with flow matching

AH Liu, M Le, A Vyas, B Shi, A Tjandra… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative models have gained more and more attention in recent years for their
remarkable success in tasks that required estimating and sampling data distribution to …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Diffvoice: Text-to-speech with latent diffusion

Z Liu, Y Guo, K Yu - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion.
We propose to first encode speech signals into a phoneme-rate latent representation with a …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Clapspeech: Learning prosody from text context with contrastive language-audio pre-training

Z Ye, R Huang, Y Ren, Z Jiang, J Liu, J He… - arXiv preprint arXiv …, 2023 - arxiv.org

Improving text representation has attracted much attention to achieve expressive text-to-
speech (TTS). However, existing works only implicitly learn the prosody with masked token …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

被引用次数：6 相关文章所有 2 个版本

高级搜索

QQ 群