A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Guided-tts: A diffusion model for text-to-speech via classifier guidance

H Kim, S Kim, S Yoon - International Conference on …, 2022 - proceedings.mlr.press
We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require
any transcript of target speaker using classifier guidance. Guided-TTS combines an …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

One TTS alignment to rule them all

R Badlani, A Łańcucki, KJ Shih, R Valle… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Speech-to-text alignment is a critical component of neural text-to-speech (TTS) models.
Autoregressive TTS models typically use an attention mechanism to learn these alignments …

JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text to speech

D Lim, S Jung, E Kim - arXiv preprint arXiv:2203.16852, 2022 - arxiv.org
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models
have shown synthesis quality close to human speech. For example, FastSpeech2 transforms …

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

H Siuzdak, P Dura, P van Rijn, N Jacoby - arXiv preprint arXiv:2203.16930, 2022 - arxiv.org
Recent advances in neural text-to-speech research have been dominated by two-stage
pipelines utilizing low-level intermediate speech representation such as mel-spectrograms …

Language-agnostic meta-learning for low-resource text-to-speech with articulatory features

F Lux, NT Vu - arXiv preprint arXiv:2203.03191, 2022 - arxiv.org
While neural text-to-speech systems perform remarkably well in high-resource scenarios,
they cannot be applied to the majority of the over 6,000 spoken languages in the world due …

Phone-to-audio alignment without text: A semi-supervised approach

J Zhu, C Zhang, D Jurgens - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
The task of phone-to-audio alignment has many applications in speech research. Here we
introduce two Wav2Vec2-based models for both text-dependent and text-independent …

Dailytalk: Spoken dialogue dataset for conversational text-to-speech

K Lee, K Park, D Kim - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
The majority of current Text-to-Speech (TTS) datasets, which are collections of individual
utterances, contain few conversational aspects. In this paper, we introduce DailyTalk, a high …