Deep representation learning in speech processing: Challenges, recent advances, and future trends

S Latif, R Rana, S Khalifa, R Jurdak, J Qadir… - arXiv preprint arXiv …, 2020 - arxiv.org
Research on speech processing has traditionally considered the task of designing hand-
engineered acoustic features (feature engineering) as a separate distinct problem from the …

Audio deepfake detection: A survey

J Yi, C Wang, J Tao, X Zhang, CY Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio deepfake detection is an emerging active topic. A growing number of literatures have
aimed to study deepfake detection algorithms and achieved effective performance, the …

wav2vec 2.0: A framework for self-supervised learning of speech representations

A Baevski, Y Zhou, A Mohamed… - Advances in neural …, 2020 - proceedings.neurips.cc
We show for the first time that learning powerful representations from speech audio alone
followed by fine-tuning on transcribed speech can outperform the best semi-supervised …

Tera: Self-supervised learning of transformer encoder representation for speech

AT Liu, SW Li, H Lee - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
We introduce a self-supervised speech pre-training method called TERA, which stands for
Transformer Encoder Representations from Alteration. Recent approaches often learn by …

vq-wav2vec: Self-supervised learning of discrete speech representations

A Baevski, S Schneider, M Auli - arXiv preprint arXiv:1910.05453, 2019 - arxiv.org
We propose vq-wav2vec to learn discrete representations of audio segments through a
wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel …

wav2vec: Unsupervised pre-training for speech recognition

S Schneider, A Baevski, R Collobert, M Auli - arXiv preprint arXiv …, 2019 - arxiv.org
We explore unsupervised pre-training for speech recognition by learning representations of
raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting …

Deep audio-visual speech recognition

T Afouras, JS Chung, A Senior… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

Speaker recognition from raw waveform with sincnet

M Ravanelli, Y Bengio - 2018 IEEE spoken language …, 2018 - ieeexplore.ieee.org
Deep learning is progressively gaining popularity as a viable alternative to i-vectors for
speaker recognition. Promising results have been recently obtained with Convolutional …

LEAF: A learnable frontend for audio classification

N Zeghidour, O Teboul, FDC Quitry… - arXiv preprint arXiv …, 2021 - arxiv.org
Mel-filterbanks are fixed, engineered audio features which emulate human perception and
have been used through the history of audio understanding up to today. However, their …

Interpretable convolutional filters with sincnet

M Ravanelli, Y Bengio - arXiv preprint arXiv:1811.09725, 2018 - arxiv.org
Deep learning is currently playing a crucial role toward higher levels of artificial intelligence.
This paradigm allows neural networks to learn complex and abstract representations, that …