Soundspaces 2.0: A simulation platform for visual-acoustic learning

C Chen, C Schissler, S Garg… - Advances in …, 2022 - proceedings.neurips.cc
Abstract We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio
rendering for 3D environments. Given a 3D mesh of a real-world environment …

Semantic audio-visual navigation

C Chen, Z Al-Halah, K Grauman - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts
the role of audio to signaling the target's position. We introduce semantic audio-visual …

Toward practical monocular indoor depth estimation

CY Wu, J Wang, M Hall… - Proceedings of the …, 2022 - openaccess.thecvf.com
The majority of prior monocular depth estimation methods without groundtruth depth
guidance focus on driving scenarios. We show that such methods generalize poorly to …

Few-shot audio-visual learning of environment acoustics

S Majumder, C Chen, Z Al-Halah… - Advances in Neural …, 2022 - proceedings.neurips.cc
Room impulse response (RIR) functions capture how the surrounding physical environment
transforms the sounds heard by a listener, with implications for various applications in AR …

Pathdreamer: A world model for indoor navigation

JY Koh, H Lee, Y Yang, J Baldridge… - Proceedings of the …, 2021 - openaccess.thecvf.com
People navigating in unfamiliar buildings take advantage of myriad visual, spatial and
semantic cues to efficiently achieve their navigation goals. Towards equipping …

Move2hear: Active audio-visual source separation

S Majumder, Z Al-Halah… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We introduce the active audio-visual source separation problem, where an agent must move
intelligently in order to better isolate the sounds coming from an object of interest in its …

Context understanding in computer vision: A survey

X Wang, Z Zhu - Computer Vision and Image Understanding, 2023 - Elsevier
Contextual information plays an important role in many computer vision tasks, such as object
detection, video action detection, image classification, etc. Recognizing a single object or …

Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

P Chen, X Sun, H Zhi, R Zeng, TH Li, G Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet
challenging problem in which an agent learns to navigate following a path described by …

Listening human behavior: 3d human pose estimation with acoustic signals

Y Shibata, Y Kawashima, M Isogawa… - Proceedings of the …, 2023 - openaccess.thecvf.com
Given only acoustic signals without any high-level information, such as voices or sounds of
scenes/actions, how much can we infer about the behavior of humans? Unlike existing …

Disentangled counterfactual learning for physical audiovisual commonsense reasoning

C Lv, S Zhang, Y Tian, M Qi… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we propose a Disentangled Counterfactual Learning (DCL) approach for
physical audiovisual commonsense reasoning. The task aims to infer objects' physics …