HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

S Arora, A Pasad, CM Chien, J Han, R Sharma… - arXiv preprint arXiv …, 2024 - arxiv.org
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was
recently introduced to address the need for open resources and benchmarking of complex …

One Model to Rule Them All: A Universal Transformer for Biometric Matching

M Abdrakhmanova, A Yermekova, Y Barko… - IEEE …, 2024 - ieeexplore.ieee.org
This study introduces the first single branch network designed to tackle a spectrum of
biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing …

Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

WC Wang, S De Coninck, S Leroux… - Frontiers in Robotics and …, 2025 - frontiersin.org
Smart cities deploy various sensors such as microphones and RGB cameras to collect data
to improve the safety and comfort of the citizens. As data annotation is expensive, self …

Measuring Sound Symbolism In Audio-Visual Models

WC Tseng, YJ Shih, D Harwath… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Audio-visual pre-trained models have gained substantial attention recently and
demonstrated superior performance on various audio-visual tasks. This study investigates …

Unveiling the linguistic capabilities of a self-supervised speech model through cross-lingual benchmark and layer-wise similarity analysis

T Ashihara, M Delcroix, Y Ijima, M Kashino - IEEE Access, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL), an unsupervised representation learning technique, has
received widespread attention across various modalities. Speech, with its inherent …

Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

D Gimeno-Gómez, CD Martínez-Hinarejos - arXiv preprint arXiv …, 2024 - arxiv.org
Thanks to the rise of deep learning and the availability of large-scale audio-visual
databases, recent advances have been achieved in Visual Speech Recognition (VSR) …

MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

P Xiang, C Lin, K Wu, O Bai - arXiv preprint arXiv:2404.18327, 2024 - arxiv.org
This paper presents a novel approach to processing multimodal data for dynamic emotion
recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion …

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Y Wang, W Guo, R Huang, J Huang, Z Wang… - The Thirty-eighth Annual … - openreview.net
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent
video, and it remains challenging to build V2A models with high generation quality …