Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Unsupervised cross-lingual representation learning for speech recognition

A Conneau, A Baevski, R Collobert… - arXiv preprint arXiv …, 2020 - arxiv.org
This paper presents XLSR which learns cross-lingual speech representations by pretraining
a single model from the raw waveform of speech in multiple languages. We build on …

On generative spoken language modeling from raw audio

K Lakhotia, E Kharitonov, WN Hsu, Y Adi… - Transactions of the …, 2021 - direct.mit.edu
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …

Speech resynthesis from discrete disentangled self-supervised representations

A Polyak, Y Adi, J Copet, E Kharitonov… - arXiv preprint arXiv …, 2021 - arxiv.org
We propose using self-supervised discrete representations for the task of speech
resynthesis. To generate disentangled representation, we separately extract low-bitrate …

Toward understanding the communication in sperm whales

J Andreas, G Beguš, MM Bronstein, R Diamant… - Iscience, 2022 - cell.com
Machine learning has been advancing dramatically over the past decade. Most strides are
human-based applications due to the availability of large-scale datasets; however …

SLAM: A unified encoder for speech and language modeling via speech-text joint pre-training

A Bapna, Y Chung, N Wu, A Gulati, Y Jia… - arXiv preprint arXiv …, 2021 - arxiv.org
Unsupervised pre-training is now the predominant approach for both text and speech
understanding. Self-attention models pre-trained on large amounts of unannotated data …

Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge

B Van Niekerk, L Nortje, H Kamper - arXiv preprint arXiv:2005.09409, 2020 - arxiv.org
In this paper, we explore vector quantization for acoustic unit discovery. Leveraging
unlabelled data, we aim to learn discrete representations of speech that separate phonetic …

Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization

Y Takida, T Shibuya, WH Liao, CH Lai… - arXiv preprint arXiv …, 2022 - arxiv.org
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned
discrete representation uses only a fraction of the full capacity of the codebook, also known …

The zero resource speech challenge 2019: TTS without T

E Dunbar, R Algayres, J Karadayi, M Bernard… - arXiv preprint arXiv …, 2019 - arxiv.org
We present the Zero Resource Speech Challenge 2019, which proposes to build a speech
synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …