Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Self-supervised fine-tuning for improved content representations by speaker-invariant clustering

HJ Chang, AH Liu, J Glass - arXiv preprint arXiv:2305.11072, 2023 - arxiv.org
Self-supervised speech representation models have succeeded in various tasks, but
improving them for content-related problems using unlabeled data is challenging. We …

Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition

DM Chan, S Ghosh, A Rastrow… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Despite improvements to the generalization performance of automated speech recognition
(ASR) models, specializing ASR models for downstream tasks remains a challenging task …

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

HJ Chang, J Glass - arXiv preprint arXiv:2311.09117, 2023 - arxiv.org
This paper introduces Robust Spin (R-Spin), a data-efficient self-supervised fine-tuning
framework for speaker and noise-invariant speech representations by learning discrete …

Representation Purification for End-to-End Speech Translation

C Zhang, Y Zhou, R Zhao, Y Chen, X Shi - arXiv preprint arXiv:2412.04266, 2024 - arxiv.org
Speech-to-text translation (ST) is a cross-modal task that involves converting spoken
language into text in a different language. Previous research primarily focused on …

CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

X Zhao, H Sun, Y Lei, S Zhu… - Findings of the Association …, 2023 - aclanthology.org
Deep neural networks have demonstrated their capacity in extracting features from speech
inputs. However, these features may include non-linguistic speech factors such as timbre …

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

DM Chan, S Ghosh, H Tulsiani… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
While word error rates of automatic speech recognition (ASR) systems have consistently
fallen, natural language understanding (NLU) applications built on top of ASR systems still …

Perturbation-invariant Speech Representation Learning by Online Clustering

HJ Chang - 2024 - dspace.mit.edu
Despite success across various tasks, self-supervised speech models face significant
challenges in enhancing content-related performance with unlabeled data, requiring …

Uncertainty and the Medical Interview: Towards Self-Assessment in Machine Learning Models

JD Havtorn - 2024 - orbit.dtu.dk
Natural language plays a key role in healthcare systems worldwide; yet, the medical
interview process has seen little development compared to the strides made in medical …

Understanding, Building, and Evaluating Models for Context Aware Conditional Natural Language Generation

DM Chan - 2024 - search.proquest.com
If you ask a human to describe an image, they might do so in a thousand different ways.
Each of these descriptions depends not only on the image but also on a rich tapestry of …