Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

Film: Following instructions in language with modular methods

SY Min, DS Chaplot, P Ravikumar, Y Bisk… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent methods for embodied instruction following are typically trained end-to-end using
imitation learning. This often requires the use of expert trajectories and low-level language …

Embodied concept learner: Self-supervised learning of concepts and mapping through instruction following

M Ding, Y Xu, Z Chen, DD Cox, P Luo… - … on robot learning, 2023 - proceedings.mlr.press
Humans, even at a very early age, can learn visual concepts and understand geometry and
layout through active interaction with the environment, and generalize their compositions to …

Episodic memory question answering

S Datta, S Dharur, V Cartillier, R Desai… - Proceedings of the …, 2022 - openaccess.thecvf.com
Egocentric augmented reality devices such as wearable glasses passively capture visual
data as a human wearer tours a home environment. We envision a scenario wherein the …

Learning 3d dynamic scene representations for robot manipulation

Z Xu, Z He, J Wu, S Song - arXiv preprint arXiv:2011.01968, 2020 - arxiv.org
3D scene representation for robot manipulation should capture three key object properties:
permanency--objects that become occluded over time continue to exist; amodal …

Visual language navigation: A survey and open challenges

SM Park, YG Kim - Artificial Intelligence Review, 2023 - Springer
With the recent development of deep learning, AI models are widely used in various
domains. AI models show good performance for definite tasks such as image classification …

Fast and explicit neural view synthesis

P Guo, MA Bautista, A Colburn… - Proceedings of the …, 2022 - openaccess.thecvf.com
We study the problem of novel view synthesis from sparse source observations of a scene
comprised of 3D objects. We propose a simple yet effective approach that is neither …

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

O Unal, C Sakaridis, S Saha, L Van Gool - European Conference on …, 2025 - Springer
Abstract 3D visual grounding is the task of localizing the object in a 3D scene which is
referred by a description in natural language. With a wide range of applications ranging from …

Voxel-informed language grounding

R Corona, S Zhu, D Klein, T Darrell - arXiv preprint arXiv:2205.09710, 2022 - arxiv.org
Natural language applied to natural 2D images describes a fundamentally 3D world. We
present the Voxel-informed Language Grounder (VLG), a language grounding model that …

Multi-Attribute Interactions Matter for 3D Visual Grounding

C Xu, Y Han, R Xu, L Hui, J Xie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract 3D visual grounding aims to localize 3D objects described by free-form language
sentences. Following the detection-then-matching paradigm existing methods mainly focus …