A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

Z Jiang, Y Ren, Z Ye, J Liu, C Zhang, Q Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …

P-flow: a fast and data-efficient zero-shot TTS through speech prompting

S Kim, K Shih, JF Santos… - Advances in …, 2024 - proceedings.neurips.cc
While recent large-scale neural codec language models have shown significant
improvement in zero-shot TTS by training on thousands of hours of data, they suffer from …

Generative pre-training for speech with flow matching

AH Liu, M Le, A Vyas, B Shi, A Tjandra… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative models have gained more and more attention in recent years for their
remarkable success in tasks that required estimating and sampling data distribution to …

Diffvoice: Text-to-speech with latent diffusion

Z Liu, Y Guo, K Yu - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion.
We propose to first encode speech signals into a phoneme-rate latent representation with a …

Clapspeech: Learning prosody from text context with contrastive language-audio pre-training

Z Ye, R Huang, Y Ren, Z Jiang, J Liu, J He… - arXiv preprint arXiv …, 2023 - arxiv.org
Improving text representation has attracted much attention to achieve expressive text-to-
speech (TTS). However, existing works only implicitly learn the prosody with masked token …

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …