Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Learning audio-visual speech representation by masked multimodal cluster prediction

B Shi, WN Hsu, K Lakhotia, A Mohamed - arXiv preprint arXiv:2201.02184, 2022 - arxiv.org
Video recordings of speech contain correlated audio and visual information, providing a
strong signal for speech representation learning from the speaker's lip movements and the …

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

Multimodal conversational ai: A survey of datasets and approaches

A Sundar, L Heck - arXiv preprint arXiv:2205.06907, 2022 - arxiv.org
As humans, we experience the world with all our senses or modalities (sound, sight, touch,
smell, and taste). We use these modalities, particularly sight and touch, to convey and …

Domain adaptation with external off-policy acoustic catalogs for scalable contextual end-to-end automated speech recognition

DM Chan, S Ghosh, A Rastrow… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Despite improvements to the generalization performance of automated speech recognition
(ASR) models, specializing ASR models for downstream tasks remains a challenging task …

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

J Li, C Li, Y Wu, Y Qian - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the
accuracy and robustness of speech recognition systems with the assistance of visual cues in …

Content-context factorized representations for automated speech recognition

DM Chan, S Ghosh - arXiv preprint arXiv:2205.09872, 2022 - arxiv.org
Deep neural networks have largely demonstrated their ability to perform automated speech
recognition (ASR) by extracting meaningful features from input audio frames. Such features …

Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings

IC Chern, KH Hung, YT Chen, T Hussain… - … , Speech, and Signal …, 2023 - ieeexplore.ieee.org
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective
for categorical problems such as automatic speech recognition and lip-reading. This …

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Y Hu, R Li, C Chen, C Qin, Q Zhu, ES Chng - arXiv preprint arXiv …, 2023 - arxiv.org
Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the
noise-robustness of audio-only speech recognition with visual information. However, most …

Multi-Modal Signal Fusion: Enhancing Speech Recognition in Noisy Environments

C Veena, RJ Anandhi, A Singla, A Rana… - 2023 10th IEEE Uttar …, 2023 - ieeexplore.ieee.org
In the realm of automated speech recognition (ASR), the robustness of systems operating
within noisy environments remains a pivotal challenge. This paper introduces an innovative …