Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Avsegformer: Audio-visual segmentation with transformer

S Gao, Z Chen, G Chen, W Wang, T Lu - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a
given video, which demands audio-driven pixel-level scene understanding. The existing …

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M Xiang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J Xiao - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

Epic-sounds: A large-scale dataset of actions that sound

J Huh, J Chalk, E Kazakos, D Damen… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal
extents and class labels within the audio stream of the egocentric videos from EPIC …

Achieving cross modal generalization with multimodal unified representation

Y Xia, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …