Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arXiv preprint arXiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

Phone-to-audio alignment without text: A semi-supervised approach

J Zhu, C Zhang, D Jurgens - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
The task of phone-to-audio alignment has many applications in speech research. Here we
introduce two Wav2Vec2-based models for both text-dependent and text-independent …

Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge

E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …

A brief overview of unsupervised neural speech representation learning

L Borgholt, JD Havtorn, J Edin, L Maaløe… - arXiv preprint arXiv …, 2022 - arxiv.org
Unsupervised representation learning for speech processing has matured greatly in the last
few years. Work in computer vision and natural language processing has paved the way, but …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - arXiv preprint arXiv:2307.00162, 2023 - arxiv.org
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

S Cuervo, A Lancucki, R Marxer… - Advances in …, 2022 - proceedings.neurips.cc
The success of deep learning comes from its ability to capture the hierarchical structure of
data by learning high-level representations defined in terms of low-level ones. In this paper …

Efficient transformers with dynamic token pooling

P Nawrot, J Chorowski, A Łańcucki… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformers achieve unrivalled performance in modelling language, but remain inefficient
in terms of memory and time complexity. A possible remedy is to reduce the sequence …

On compressing sequences for self-supervised speech models

Y Meng, HJ Chen, J Shi, S Watanabe… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Compressing self-supervised models has become increasingly necessary, as self-
supervised models become larger. While previous approaches have primarily focused on …