Cross-modal attention network for temporal inconsistent audio-visual event localization

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：75 相关文章所有 5 个版本

[PDF] arxiv.org

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer

We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

被引用次数：138 相关文章所有 5 个版本

[PDF] arxiv.org

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

被引用次数：49 相关文章所有 8 个版本

[PDF] thecvf.com

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M Xiang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

被引用次数：33 相关文章所有 5 个版本

[PDF] arxiv.org

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J Xiao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

被引用次数：44 相关文章所有 4 个版本

[PDF] thecvf.com

Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception

J Gao, M Chen, C Xu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …

被引用次数：30 相关文章所有 4 个版本

[PDF] thecvf.com

Positive sample propagation along the audio-visual event line

J Zhou, L Zheng, Y Zhong, S Hao… - Proceedings of the …, 2021 - openaccess.thecvf.com

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

被引用次数：111 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com

Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

被引用次数：61 相关文章所有 8 个版本

[PDF] arxiv.org

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

被引用次数：52 相关文章所有 7 个版本

[PDF] thecvf.com

Cross-modal background suppression for audio-visual event localization

Y Xia, Z Zhao - Proceedings of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …

被引用次数：52 相关文章所有 3 个版本

高级搜索

QQ 群