Audio–visual segmentation

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：54 相关文章所有 2 个版本

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：57 相关文章所有 5 个版本

[PDF] aaai.org

Avsegformer: Audio-visual segmentation with transformer

S Gao, Z Chen, G Chen, W Wang, T Lu - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a
given video, which demands audio-driven pixel-level scene understanding. The existing …

被引用次数：26 相关文章所有 3 个版本

[PDF] thecvf.com

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M Xiang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

被引用次数：24 相关文章所有 5 个版本

[PDF] arxiv.org

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J Xiao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

被引用次数：39 相关文章所有 4 个版本

[PDF] thecvf.com

Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

被引用次数：12 相关文章所有 8 个版本

[PDF] thecvf.com

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

被引用次数：31 相关文章所有 6 个版本

[PDF] arxiv.org

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

被引用次数：37 相关文章所有 7 个版本

[PDF] arxiv.org

Epic-sounds: A large-scale dataset of actions that sound

J Huh, J Chalk, E Kazakos, D Damen… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal
extents and class labels within the audio stream of the egocentric videos from EPIC …

被引用次数：38 相关文章所有 8 个版本

[PDF] neurips.cc

Achieving cross modal generalization with multimodal unified representation

Y Xia, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2024 - proceedings.neurips.cc

This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …

被引用次数：8 相关文章所有 4 个版本

高级搜索

QQ 群

Learning in audio-visual context: A review, analysis, and new perspective

Vision transformers are parameter-efficient audio-visual learners

Avsegformer: Audio-visual segmentation with transformer

Multimodal variational auto-encoder based audio-visual segmentation

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

Sound source localization is all about cross-modal alignment

Learning audio-visual source localization via false negative aware contrastive learning

Contrastive positive sample propagation along the audio-visual event line

Epic-sounds: A large-scale dataset of actions that sound

Achieving cross modal generalization with multimodal unified representation

引用