Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Trusted multi-view classification with dynamic evidential fusion

Z Han, C Zhang, H Fu, JT Zhou - IEEE transactions on pattern …, 2022 - ieeexplore.ieee.org
Existing multi-view classification algorithms focus on promoting accuracy by exploiting
different views, typically integrating them into common representations for follow-up tasks …

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

End-to-end audio-visual speech recognition with conformers

P Ma, S Petridis, M Pantic - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …

Mavil: Masked audio-video learners

PY Huang, V Sharma, H Xu, C Ryali… - Advances in …, 2024 - proceedings.neurips.cc
Abstract We present Masked Audio-Video Learners (MAViL) to learn audio-visual
representations with three complementary forms of self-supervision:(1) reconstructing …

Lipreading using temporal convolutional networks

B Martinez, P Ma, S Petridis… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
Lip-reading has attracted a lot of research attention lately thanks to advances in deep
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …

Sub-word level lip reading with visual attention

KR Prajwal, T Afouras… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The goal of this paper is to learn strong lip reading models that can recognise speech in
silent videos. Most prior works deal with the open-set visual speech recognition problem by …

Audiovisual slowfast networks for video recognition

F Xiao, YJ Lee, K Grauman, J Malik… - arXiv preprint arXiv …, 2020 - arxiv.org
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual
perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a …

Lipnet: End-to-end sentence-level lipreading

YM Assael, B Shillingford, S Whiteson… - arXiv preprint arXiv …, 2016 - arxiv.org
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional
approaches separated the problem into two stages: designing or learning visual features …