Despite surveillance systems becoming increasingly ubiquitous in our living environment, automated surveillance, currently based on video sensory modality and machine …
A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com
The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events occur together, it suggests that there might be a common, underlying event that produced …
R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous back-ground sounds …
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a …
C Chen, U Jain, C Schissler, SVA Gari… - Computer Vision–ECCV …, 2020 - Springer
Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf—restricted to solely their visual perception of the environment. We introduce …
Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and …
S Mo, P Morgado - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding …
C Gan, D Huang, H Zhao… - Proceedings of the …, 2020 - openaccess.thecvf.com
Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical …
Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and …