Recent advances in the automatic recognition of audiovisual speech

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

被引用次数：370 相关文章所有 10 个版本

[PDF] arxiv.org

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

被引用次数：3612 相关文章所有 12 个版本

[PDF] arxiv.org

Trusted multi-view classification with dynamic evidential fusion

Z Han, C Zhang, H Fu, JT Zhou - IEEE transactions on pattern …, 2022 - ieeexplore.ieee.org

Existing multi-view classification algorithms focus on promoting accuracy by exploiting
different views, typically integrating them into common representations for follow-up tasks …

被引用次数：344 相关文章所有 9 个版本

[PDF] arxiv.org

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com

Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

被引用次数：136 相关文章所有 7 个版本

[PDF] arxiv.org

End-to-end audio-visual speech recognition with conformers

P Ma, S Petridis, M Pantic - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …

被引用次数：243 相关文章所有 4 个版本

[PDF] neurips.cc

Mavil: Masked audio-video learners

PY Huang, V Sharma, H Xu, C Ryali… - Advances in …, 2024 - proceedings.neurips.cc

Abstract We present Masked Audio-Video Learners (MAViL) to learn audio-visual
representations with three complementary forms of self-supervision:(1) reconstructing …

被引用次数：56 相关文章所有 6 个版本

[PDF] arxiv.org

Lipreading using temporal convolutional networks

B Martinez, P Ma, S Petridis… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org

Lip-reading has attracted a lot of research attention lately thanks to advances in deep
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …

被引用次数：289 相关文章所有 3 个版本

[PDF] thecvf.com

Sub-word level lip reading with visual attention

KR Prajwal, T Afouras… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The goal of this paper is to learn strong lip reading models that can recognise speech in
silent videos. Most prior works deal with the open-set visual speech recognition problem by …

被引用次数：102 相关文章所有 12 个版本

[PDF] arxiv.org

Audiovisual slowfast networks for video recognition

F Xiao, YJ Lee, K Grauman, J Malik… - arXiv preprint arXiv …, 2020 - arxiv.org

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual
perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a …

被引用次数：251 相关文章所有 2 个版本

[PDF] arxiv.org

Lipnet: End-to-end sentence-level lipreading

YM Assael, B Shillingford, S Whiteson… - arXiv preprint arXiv …, 2016 - arxiv.org

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional
approaches separated the problem into two stages: designing or learning visual features …

被引用次数：479 相关文章所有 6 个版本

高级搜索

QQ 群