Transcrib3D: 3D Referring Expression Resolution through Large Language Models

J Fang, X Tan, S Lin, I Vasiljevic, V Guizilini… - arXiv preprint arXiv …, 2024 - arxiv.org
If robots are to work effectively alongside people, they must be able to interpret natural
language references to objects in their 3D environment. Understanding 3D referring …

Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning

J Fang, X Tan, S Lin, H Mei, M Walter - 2nd Workshop on Language …, 2023 - openreview.net
If robots are to work effectively alongside people, they must be able to interpret natural
language references to objects in their 3D environment. Understanding 3D referring …

Multi3drefer: Grounding text description to multiple 3d objects

Y Zhang, ZM Gong, AX Chang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

R Lyu, T Wang, J Lin, S Yang, X Mao, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
With the emergence of LLMs and their integration with other data modalities, multi-modal 3D
perception attracts more attention due to its connectivity to the physical world and makes …

Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension

R Guan, R Zhang, N Ouyang, J Liu, KL Man… - arXiv preprint arXiv …, 2024 - arxiv.org
Embodied perception is essential for intelligent vehicles and robots, enabling more natural
interaction and task execution. However, these advancements currently embrace vision …

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes

Z Wang, H Huang, Y Zhao, Z Zhang, Z Zhao - arXiv preprint arXiv …, 2023 - arxiv.org
3D scene understanding has gained significant attention due to its wide range of
applications. However, existing methods for 3D scene understanding are limited to specific …

Lerf: Language embedded radiance fields

J Kerr, CM Kim, K Goldberg… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans describe the physical world using natural language to refer to specific 3D locations
based on a vast range of properties: visual appearance, semantics, abstract associations, or …

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

Toward explainable and fine-grained 3d grounding through referring textual phrases

Z Yuan, X Yan, Z Li, X Li, Y Guo, S Cui, Z Li - arXiv preprint arXiv …, 2022 - arxiv.org
Recent progress in 3D scene understanding has explored visual grounding (3DVG) to
localize a target object through a language description. However, existing methods only …