A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …

Adaspeech 4: Adaptive text to speech in zero-shot scenarios

Y Wu, X Tan, B Li, L He, S Zhao, R Song, T Qin… - arXiv preprint arXiv …, 2022 - arxiv.org
Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently,
by using a well-trained source TTS model without adapting it on the speech data of new …

Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions

G Liu, Y Zhang, Y Lei, Y Chen, R Wang, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Style transfer TTS has shown impressive performance in recent years. However, style
control is often restricted to systems built on expressive speech recordings with discrete style …

Generation and detection of manipulated multimodal audiovisual content: Advances, trends and open challenges

H Liz-Lopez, M Keita, A Taleb-Ahmed, A Hadid… - Information …, 2024 - Elsevier
Generative deep learning techniques have invaded the public discourse recently. Despite
the advantages, the applications to disinformation are concerning as the counter-measures …

Delightfultts 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders

Y Liu, R Xue, L He, X Tan, S Zhao - arXiv preprint arXiv:2207.04646, 2022 - arxiv.org
Current text to speech (TTS) systems usually leverage a cascaded acoustic model and
vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer …

Low-resource multilingual and zero-shot multispeaker TTS

F Lux, J Koch, NT Vu - arXiv preprint arXiv:2210.12223, 2022 - arxiv.org
While neural methods for text-to-speech (TTS) have shown great advances in modeling
multiple speakers, even in zero-shot settings, the amount of data needed for those …

MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS

H Guo, F Xie, X Wu, FK Soong… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
This article aims to improve neural TTS with vector-quantized, compact speech
representations. We propose a Vector-Quantized Variational AutoEncoder (VQ-VAE) based …

DiCLET-TTS: Diffusion model based cross-lingual emotion transfer for text-to-speech—A study between English and Mandarin

T Li, C Hu, J Cong, X Zhu, J Li, Q Tian… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
While the performance of cross-lingual TTS based on monolingual corpora has been
significantly improved recently, generating cross-lingual speech still suffers from the foreign …