Affordancellm: Grounding affordance from vision language models

S Qian, W Chen, M Bai, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
Affordance grounding refers to the task of finding the area of an object with which one can
interact. It is a fundamental but challenging task as a successful solution requires the …

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

Empowering 3D Visual Grounding with Reasoning Capabilities

C Zhu, T Wang, W Zhang, K Chen, X Liu - arXiv preprint arXiv:2407.01525, 2024 - arxiv.org
Although great progress has been made in 3D visual grounding, current models still rely on
explicit textual descriptions for grounding and lack the ability to reason human intentions …

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

R Lyu, T Wang, J Lin, S Yang, X Mao, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
With the emergence of LLMs and their integration with other data modalities, multi-modal 3D
perception attracts more attention due to its connectivity to the physical world and makes …

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

J Yang, X Chen, N Madaan, M Iyengar, S Qian… - arXiv preprint arXiv …, 2024 - arxiv.org
The integration of language and 3D perception is crucial for developing embodied agents
and robots that comprehend and interact with the physical world. While large language …

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Z Wang, Z Zhang, H Zhang, L Liu, R Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, human-computer interaction with various modalities has shown promising
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …

Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

S Linok, T Zemskova, S Ladanova, R Titkov… - arXiv preprint arXiv …, 2024 - arxiv.org
Locating objects referred to in natural language poses a significant challenge for
autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform …

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

W Kang, M Qu, J Kini, Y Wei, M Shah, Y Yan - arXiv preprint arXiv …, 2024 - arxiv.org
In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or
intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object …

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Z Wang, Z Zhang, X Cheng, R Huang, L Liu… - Forty-first International … - openreview.net
Unified multi-model representation spaces are the foundation of multimodal understanding
and generation. However, the billions of model parameters and catastrophic forgetting …