Yourefit: Embodied reference understanding with language and gesture

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2025 - Springer

Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in developing embodied agents. In …

被引用次数：39 相关文章所有 2 个版本

[PDF] neurips.cc

Humanise: Language-conditioned human motion generation in 3d scenes

Z Wang, Y Chen, T Liu, Y Zhu… - Advances in Neural …, 2022 - proceedings.neurips.cc

Learning to generate diverse scene-aware and goal-oriented human motions in 3D scenes
remains challenging due to the mediocre characters of the existing datasets on Human …

被引用次数：92 相关文章所有 6 个版本

[PDF] thecvf.com

Gsva: Generalized segmentation via multimodal large language models

Z Xia, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …

被引用次数：33 相关文章所有 3 个版本

[PDF] mlr.press

MEWL: Few-shot multimodal word learning with referential uncertainty

G Jiang, M Xu, S Xin, W Liang, Y Peng… - International …, 2023 - proceedings.mlr.press

Without explicit feedback, humans can rapidly learn the meaning of words. Children can
acquire a new word after just a few passive exposures, a process known as fast mapping …

被引用次数：20 相关文章所有 8 个版本

[PDF] thecvf.com

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Z Wang, Y Chen, B Jia, P Li, J Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Despite significant advancements in text-to-motion synthesis generating language-guided
human motion within 3D environments poses substantial challenges. These challenges …

被引用次数：22 相关文章所有 8 个版本

[PDF] openreview.net

Eqa-mx: Embodied question answering using multimodal expression

MM Islam, A Gladstone, R Islam… - The Twelfth International …, 2023 - openreview.net

Humans predominantly use verbal utterances and nonverbal gestures (eg, eye gaze and
pointing gestures) in their natural interactions. For instance, pointing gestures and verbal …

被引用次数：5 相关文章

[PDF] neurips.cc

Handmethat: Human-robot communication in physical and social environments

Y Wan, J Mao, J Tenenbaum - Advances in Neural …, 2022 - proceedings.neurips.cc

We introduce HandMeThat, a benchmark for a holistic evaluation of instruction
understanding and following in physical and social environments. While previous datasets …

被引用次数：14 相关文章所有 5 个版本

[PDF] aaai.org

Patron: perspective-aware multitask model for referring expression grounding using embodied multimodal cues

MM Islam, A Gladstone, T Iqbal - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org

Humans naturally use referring expressions with verbal utterances and nonverbal gestures
to refer to objects and events. As these referring expressions can be interpreted differently …

被引用次数：4 相关文章所有 2 个版本

[PDF] neurips.cc

CAESAR: An embodied simulator for generating multimodal referring expression datasets

MM Islam, R Mirzaiee, A Gladstone… - Advances in Neural …, 2022 - proceedings.neurips.cc

Humans naturally use verbal utterances and nonverbal gestures to refer to various objects
(known as $\textit {referring expressions} $) in different interactional scenarios. As collecting …

被引用次数：12 相关文章所有 3 个版本

Gvgnet: Gaze-directed visual grounding for learning under-specified object referring intention

K Qian, Z Zhang, W Song, J Liao - IEEE Robotics and …, 2023 - ieeexplore.ieee.org

Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES)
enable robots to infer human's object referring intention through natural languages. In this …

被引用次数：8 相关文章

高级搜索

QQ 群