A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward

M Masood, M Nawaz, KM Malik, A Javed, A Irtaza… - Applied …, 2023 - Springer
Easy access to audio-visual content on social media, combined with the availability of
modern tools such as Tensorflow or Keras, and open-source trained models, along with …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Diffsound: Discrete diffusion model for text-to-sound generation

D Yang, J Yu, H Wang, W Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Viola: Conditional language models for speech recognition, synthesis, and translation

T Wang, L Zhou, Z Zhang, Y Wu, S Liu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …

Diffwave: A versatile diffusion model for audio synthesis

Z Kong, W Ping, J Huang, K Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional
and unconditional waveform generation. The model is non-autoregressive, and converts the …

Wavegrad: Estimating gradients for waveform generation

N Chen, Y Zhang, H Zen, RJ Weiss, M Norouzi… - arXiv preprint arXiv …, 2020 - arxiv.org
This paper introduces WaveGrad, a conditional model for waveform generation which
estimates gradients of the data density. The model is built on prior work on score matching …

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …