Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things

J Zhang, D Tao - IEEE Internet of Things Journal, 2020 - ieeexplore.ieee.org
In the Internet-of-Things (IoT) era, billions of sensors and devices collect and process data
from the environment, transmit them to cloud centers, and receive feedback via the Internet …

Lessons from infant learning for unsupervised machine learning

L Zaadnoordijk, TR Besold, R Cusack - Nature Machine Intelligence, 2022 - nature.com
The desire to reduce the dependence on curated, labeled datasets and to leverage the vast
quantities of unlabeled data has triggered renewed interest in unsupervised (or self …

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer
We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Not only look, but also listen: Learning multimodal violence detection under weak supervision

P Wu, J Liu, Y Shi, Y Sun, F Shao, Z Wu… - Computer Vision–ECCV …, 2020 - Springer
Violence detection has been studied in computer vision for years. However, previous work
are either superficial, eg, classification of short-clips, and the single scenario, or …

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

Localizing visual sounds the hard way

H Chen, W Xie, T Afouras, A Nagrani… - Proceedings of the …, 2021 - openaccess.thecvf.com
The objective of this work is to localize sound sources that are visible in a video without
using manual annotations. Our key technical contribution is to show that, by training the …

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

E Kazakos, A Nagrani, A Zisserman… - Proceedings of the …, 2019 - openaccess.thecvf.com
We focus on multi-modal fusion for egocentric action recognition, and propose a novel
architecture for multi-modal temporal-binding, ie the combination of modalities within a …

Audio-visual scene analysis with self-supervised multisensory features

A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com
The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …