A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Bigvgan: A universal neural vocoder with large-scale training

S Lee, W Ping, B Ginsburg, B Catanzaro… - arXiv preprint arXiv …, 2022 - arxiv.org
Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …

[HTML][HTML] Computer-assisted pronunciation training—Speech synthesis is almost all you need

D Korzekwa, J Lorenzo-Trueba, T Drugman… - Speech …, 2022 - Elsevier
The research community has long studied computer-assisted pronunciation training (CAPT)
methods in non-native speech. Researchers focused on studying various model …

MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS

H Guo, F Xie, X Wu, FK Soong… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
This article aims to improve neural TTS with vector-quantized, compact speech
representations. We propose a Vector-Quantized Variational AutoEncoder (VQ-VAE) based …

Cross-speaker style transfer for text-to-speech using data augmentation

MS Ribeiro, J Roth, G Comini… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data
augmentation via voice conversion. We assume to have a corpus of neutral non-expressive …

Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech

R Shah, K Pokora, A Ezzerg, V Klimkov… - arXiv preprint arXiv …, 2021 - arxiv.org
Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they
typically require a large amount of recordings from the target speaker. In previous work, a 3 …

Creating new voices using normalizing flows

P Bilinski, T Merritt, A Ezzerg, K Pokora… - arXiv preprint arXiv …, 2023 - arxiv.org
Creating realistic and natural-sounding synthetic speech remains a big challenge for voice
identities unseen during training. As there is growing interest in synthesizing voices of new …

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …

Text-free non-parallel many-to-many voice conversion using normalising flow

T Merritt, A Ezzerg, P Biliński… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Non-parallel voice conversion (VC) is typically achieved using lossy representations of the
source speech. However, ensuring only speaker identity information is dropped whilst all …