Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2025 - Springer
Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in developing embodied agents. In …

Humanise: Language-conditioned human motion generation in 3d scenes

Z Wang, Y Chen, T Liu, Y Zhu… - Advances in Neural …, 2022 - proceedings.neurips.cc
Learning to generate diverse scene-aware and goal-oriented human motions in 3D scenes
remains challenging due to the mediocre characters of the existing datasets on Human …

Gsva: Generalized segmentation via multimodal large language models

Z Xia, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …

MEWL: Few-shot multimodal word learning with referential uncertainty

G Jiang, M Xu, S Xin, W Liang, Y Peng… - International …, 2023 - proceedings.mlr.press
Without explicit feedback, humans can rapidly learn the meaning of words. Children can
acquire a new word after just a few passive exposures, a process known as fast mapping …

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Z Wang, Y Chen, B Jia, P Li, J Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite significant advancements in text-to-motion synthesis generating language-guided
human motion within 3D environments poses substantial challenges. These challenges …

Eqa-mx: Embodied question answering using multimodal expression

MM Islam, A Gladstone, R Islam… - The Twelfth International …, 2023 - openreview.net
Humans predominantly use verbal utterances and nonverbal gestures (eg, eye gaze and
pointing gestures) in their natural interactions. For instance, pointing gestures and verbal …

Handmethat: Human-robot communication in physical and social environments

Y Wan, J Mao, J Tenenbaum - Advances in Neural …, 2022 - proceedings.neurips.cc
We introduce HandMeThat, a benchmark for a holistic evaluation of instruction
understanding and following in physical and social environments. While previous datasets …

Patron: perspective-aware multitask model for referring expression grounding using embodied multimodal cues

MM Islam, A Gladstone, T Iqbal - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Humans naturally use referring expressions with verbal utterances and nonverbal gestures
to refer to objects and events. As these referring expressions can be interpreted differently …

CAESAR: An embodied simulator for generating multimodal referring expression datasets

MM Islam, R Mirzaiee, A Gladstone… - Advances in Neural …, 2022 - proceedings.neurips.cc
Humans naturally use verbal utterances and nonverbal gestures to refer to various objects
(known as $\textit {referring expressions} $) in different interactional scenarios. As collecting …

Gvgnet: Gaze-directed visual grounding for learning under-specified object referring intention

K Qian, Z Zhang, W Song, J Liao - IEEE Robotics and …, 2023 - ieeexplore.ieee.org
Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES)
enable robots to infer human's object referring intention through natural languages. In this …