[HTML][HTML] Rep-ViG-Apple: A CNN-GCN Hybrid Model for Apple Detection in Complex Orchard Environments

B Han, Z Lu, J Zhang, R Almodfer, Z Wang, W Sun… - Agronomy, 2024 - mdpi.com
Accurately recognizing apples in complex environments is essential for automating apple
picking operations, particularly under challenging natural conditions such as cloudy, snowy …

Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition

S Kim, K Jang, S Bae, H Kim… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio
and video modalities. In practical environments with noise-corrupted audio, the role of video …

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Y Dai, H Chen, J Du, R Wang, S Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed
to be sensitive to missing video frames performing even worse than single-modality models …

DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

J Wu, Y Zhang, M Zhang, C Zheng, X Zhang, L Xie… - Applied …, 2025 - Springer
Speech recognition is a major communication channel for human-machine interaction with
outstanding breakthroughs. However, the practicality of single-modal speech recognition is …

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

G Yang, Z Ma, F Yu, Z Gao, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
As more and more information-rich data like video become available, utilizing multi-modal
auxiliary information to enhance audio tasks has sparked widespread research interest. The …

GFRN-SEA: Global-Aware Feature Representation Network for Speech Emotion Analysis

L Pan, Q Wang - IEEE Access, 2024 - ieeexplore.ieee.org
With the rapid advancement of artificial intelligence and machine learning, speech emotion
recognition (SER) holds significant potential across various applications. Despite progress …

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

H Wang, X Wan, N Zheng, K Liu, H Zhou, G Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains
two or more languages accurately. To better capture language-specific speech …

[PDF][PDF] Multi-Model Dual-Transformer Network for Audio-Visual Speech Enhancement

FE Wahab, N Saleem, A Hussain, R Ullah… - 3rd COG-MHEAR …, 2024 - isca-archive.org
Visual features offer important cues that can be used in noisy backgrounds. Audio-visual
speech enhancement (AVSE) improves speech quality and intelligibility by combining audio …

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

H Wang, P Guo, X Wan, H Zhou, L Xie - arXiv preprint arXiv:2404.05466, 2024 - arxiv.org
Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a
speaker's silent lip motion captured in video. Current mainstream lip-reading approaches …

[HTML][HTML] Low-Voltage Biological Electric Shock Fault Diagnosis Based on the Attention Mechanism Fusion Parallel Convolutional Neural Network/Bidirectional Long …

M Lin, Y Luo, S Chen, Z Qiu, Z Dai - Mathematics, 2024 - mdpi.com
Electric shock protection is critical for ensuring power safety in low-voltage grids, and robust
fault diagnosis methods provide an essential foundation for the accurate operation of such …