Survey on automatic lip-reading in the era of deep learning

A Fernandez-Lopez, FM Sukno - Image and Vision Computing, 2018 - Elsevier
In the last few years, there has been an increasing interest in developing systems for
Automatic Lip-Reading (ALR). Similarly to other computer vision applications, methods …

Audio-visual speech and gesture recognition by sensors of mobile devices

D Ryumin, D Ivanko, E Ryumina - Sensors, 2023 - mdpi.com
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …

Lipnet: End-to-end sentence-level lipreading

YM Assael, B Shillingford, S Whiteson… - arXiv preprint arXiv …, 2016 - arxiv.org
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional
approaches separated the problem into two stages: designing or learning visual features …

End-to-end audiovisual speech recognition

S Petridis, T Stafylakis, P Ma, F Cai… - … on acoustics, speech …, 2018 - ieeexplore.ieee.org
Several end-to-end deep learning approaches have been recently presented which extract
either audio or visual features from the input images or audio signals and perform speech …

A survey of research on lipreading technology

M Hao, M Mamut, N Yadikar, A Aysa, K Ubul - IEEE Access, 2020 - ieeexplore.ieee.org
Although automatic speech recognition (ASR) technology is mature, there are still some
unsolved problems, such as how to accurately identify what the speaker is saying in a noisy …

Large-scale visual speech recognition

B Shillingford, Y Assael, MW Hoffman, T Paine… - arXiv preprint arXiv …, 2018 - arxiv.org
This work presents a scalable solution to open-vocabulary visual speech recognition. To
achieve this, we constructed the largest existing visual speech recognition dataset …

Audio-visual speech recognition with a hybrid ctc/attention architecture

S Petridis, T Stafylakis, P Ma… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
Recent works in speech recognition rely either on connectionist temporal classification
(CTC) or sequence-to-sequence models for character-level recognition. CTC assumes …

[PDF][PDF] Lipnet: Sentence-level lipreading

YM Assael, B Shillingford, S Whiteson… - arXiv preprint arXiv …, 2016 - innovators-guide.ch
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional
approaches separated the problem into two stages: designing or learning visual features …

Learning spatio-temporal features with two-stream deep 3d cnns for lipreading

X Weng, K Kitani - arXiv preprint arXiv:1905.02540, 2019 - arxiv.org
We focus on the word-level visual lipreading, which requires recognizing the word being
spoken, given only the video but not the audio. State-of-the-art methods explore the use of …

End-to-end audiovisual fusion with LSTMs

S Petridis, Y Wang, Z Li, M Pantic - arXiv preprint arXiv:1709.04343, 2017 - arxiv.org
Several end-to-end deep learning approaches have been recently presented which
simultaneously extract visual features from the input images and perform visual speech …