Learning to localize sound source in visual scenes

Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things

J Zhang, D Tao - IEEE Internet of Things Journal, 2020 - ieeexplore.ieee.org

In the Internet-of-Things (IoT) era, billions of sensors and devices collect and process data
from the environment, transmit them to cloud centers, and receive feedback via the Internet …

被引用次数：532 相关文章所有 4 个版本

[PDF] tue.nl

Lessons from infant learning for unsupervised machine learning

L Zaadnoordijk, TR Besold, R Cusack - Nature Machine Intelligence, 2022 - nature.com

The desire to reduce the dependence on curated, labeled datasets and to leverage the vast
quantities of unlabeled data has triggered renewed interest in unsupervised (or self …

被引用次数：51 相关文章所有 6 个版本

[PDF] arxiv.org

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer

We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

被引用次数：114 相关文章所有 5 个版本

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：54 相关文章所有 5 个版本

[PDF] arxiv.org

Not only look, but also listen: Learning multimodal violence detection under weak supervision

P Wu, J Liu, Y Shi, Y Sun, F Shao, Z Wu… - Computer Vision–ECCV …, 2020 - Springer

Violence detection has been studied in computer vision for years. However, previous work
are either superficial, eg, classification of short-clips, and the single scenario, or …

被引用次数：291 相关文章所有 6 个版本

[PDF] thecvf.com

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

被引用次数：92 相关文章所有 8 个版本

[PDF] arxiv.org

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

被引用次数：258 相关文章所有 8 个版本

[PDF] thecvf.com

Localizing visual sounds the hard way

H Chen, W Xie, T Afouras, A Nagrani… - Proceedings of the …, 2021 - openaccess.thecvf.com

The objective of this work is to localize sound sources that are visible in a video without
using manual annotations. Our key technical contribution is to show that, by training the …

被引用次数：187 相关文章所有 7 个版本

[PDF] thecvf.com

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

E Kazakos, A Nagrani, A Zisserman… - Proceedings of the …, 2019 - openaccess.thecvf.com

We focus on multi-modal fusion for egocentric action recognition, and propose a novel
architecture for multi-modal temporal-binding, ie the combination of modalities within a …

被引用次数：393 相关文章所有 15 个版本

[PDF] thecvf.com

Audio-visual scene analysis with self-supervised multisensory features

A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com

The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …

被引用次数：858 相关文章所有 8 个版本

高级搜索

QQ 群