A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

High fidelity neural audio compression

A Défossez, J Copet, G Synnaeve, Y Adi - arXiv preprint arXiv:2210.13438, 2022 - arxiv.org
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …

[PDF][PDF] The age of synthetic realities: Challenges and opportunities

JP Cardenuto, J Yang, R Padilha… - … on Signal and …, 2023 - nowpublishers.com
Synthetic realities are digital creations or augmentations that are contextually generated
through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data …

Speech enhancement and dereverberation with diffusion-based generative models

J Richter, S Welker, JM Lemercier… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In this work, we build upon our previous publication and use diffusion-based generative
models for speech enhancement. We present a detailed overview of the diffusion process …

A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai

C Zhang, C Zhang, S Zheng, M Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative AI has demonstrated impressive performance in various fields, among which
speech synthesis is an interesting direction. With the diffusion model as the most popular …

Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation

JM Lemercier, J Richter, S Welker… - … /ACM Transactions on …, 2023 - ieeexplore.ieee.org
Diffusion models have shown a great ability at bridging the performance gap between
predictive and generative approaches for speech enhancement. We have shown that they …

HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis

SH Lee, SB Kim, JH Lee, E Song… - Advances in Neural …, 2022 - proceedings.neurips.cc
This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system
based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised …

CoreDiff: Contextual error-modulated generalized diffusion model for low-dose CT denoising and generalization

Q Gao, Z Li, J Zhang, Y Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Low-dose computed tomography (CT) images suffer from noise and artifacts due to photon
starvation and electronic noise. Recently, some works have attempted to use diffusion …

Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration

JM Lemercier, J Richter, S Welker… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Diffusion-based generative models have had a high impact on the computer vision and
speech processing communities these past years. Besides data generation tasks, they have …

APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra

Y Ai, ZH Ling - IEEE/ACM Transactions on Audio, Speech, and …, 2023 - ieeexplore.ieee.org
This paper presents a novel neural vocoder named APNet which reconstructs speech
waveforms from acoustic features by predicting amplitude and phase spectra directly. The …