The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex …
This study introduces the first single branch network designed to tackle a spectrum of biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing …
WC Wang, S De Coninck, S Leroux… - Frontiers in Robotics and …, 2025 - frontiersin.org
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self …
Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates …
Self-supervised learning (SSL), an unsupervised representation learning technique, has received widespread attention across various modalities. Speech, with its inherent …
Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR) …
P Xiang, C Lin, K Wu, O Bai - arXiv preprint arXiv:2404.18327, 2024 - arxiv.org
This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion …
Y Wang, W Guo, R Huang, J Huang, Z Wang… - The Thirty-eighth Annual … - openreview.net
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality …