Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

J Wang, Z Guo, C Yang, X Li… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual
speech recognition accuracy. Existing works are mainly prone to design the multi-modality …

An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module

J Wang, C Yang, Z Guo, X Li… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Compared to relying only on audio information, incorporating visual information improves
speech recognition accuracy in noisy environments. Existing works are prone to design …

Multimodal sparse transformer network for audio-visual speech recognition

Q Song, B Sun, S Li - IEEE Transactions on Neural Networks …, 2022 - ieeexplore.ieee.org
Automatic speech recognition (ASR) is the major human–machine interface in many
intelligent systems, such as intelligent homes, autonomous driving, and servant robots …

Attention based multi modal learning for audio visual speech recognition

LA Kumar, DK Renuka, SL Rose… - 2022 4th …, 2022 - ieeexplore.ieee.org
In recent years, multimodal fusion using deep learning has proliferated in various tasks such
as emotion recognition, and speech recognition by drastically enhancing the performance of …

Improving audio-visual speech recognition by lip-subword correlation based visual pre-training and cross-modal fusion encoder

Y Dai, H Chen, J Du, X Ding, N Ding… - … on Multimedia and …, 2023 - ieeexplore.ieee.org
In recent research, slight performance improvement is observed from automatic speech
recognition systems to audio-visual speech recognition systems in end-to-end frameworks …

MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition

H Wang, P Guo, P Zhou, L Xie - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
While automatic speech recognition (ASR) systems degrade significantly in noisy
environments, audio-visual speech recognition (AVSR) systems aim to complement the …

Robust audio-visual speech recognition based on hybrid fusion

H Liu, W Li, B Yang - 2020 25th International Conference on …, 2021 - ieeexplore.ieee.org
The fusion of audio and visual modalities is an important stage of audio-visual speech
recognition (AVSR), which is generally approached through feature fusion or decision …

CATNet: Cross-modal fusion for audio–visual speech recognition

X Wang, J Mi, B Li, Y Zhao, J Meng - Pattern Recognition Letters, 2024 - Elsevier
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts
human speeches into texts. With the aid of advanced deep learning models, the …

Robust audio-visual mandarin speech recognition based on adaptive decision fusion and tone features

H Liu, Z Chen, W Shi - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org
Audio-visual speech recognition (AVSR) integrates both audio and visual information to
perform automatic speech recognition (ASR), which improves the robustness of human …

Modality attention for end-to-end audio-visual speech recognition

P Zhou, W Yang, W Chen, Y Wang… - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising
solutions for robust speech recognition, especially in noisy environment. In this paper, we …