Chat-3d v2: Bridging 3d scene and large language models with object identifiers

S Qian, W Chen, M Bai, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com

Affordance grounding refers to the task of finding the area of an object with which one can
interact. It is a fundamental but challenging task as a successful solution requires the …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Empowering 3D Visual Grounding with Reasoning Capabilities

C Zhu, T Wang, W Zhang, K Chen, X Liu - arXiv preprint arXiv:2407.01525, 2024 - arxiv.org

Although great progress has been made in 3D visual grounding, current models still rely on
explicit textual descriptions for grounding and lack the ability to reason human intentions …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

R Lyu, T Wang, J Lin, S Yang, X Mao, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D
perception attracts more attention due to its connectivity to the physical world and makes …

相关文章所有 2 个版本

[PDF] arxiv.org

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

J Yang, X Chen, N Madaan, M Iyengar, S Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

The integration of language and 3D perception is crucial for developing embodied agents
and robots that comprehend and interact with the physical world. While large language …

[PDF] arxiv.org

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Z Wang, Z Zhang, H Zhang, L Liu, R Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, human-computer interaction with various modalities has shown promising
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …

相关文章所有 2 个版本

[PDF] arxiv.org

Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

S Linok, T Zemskova, S Ladanova, R Titkov… - arXiv preprint arXiv …, 2024 - arxiv.org

Locating objects referred to in natural language poses a significant challenge for
autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform …

相关文章所有 2 个版本

[PDF] arxiv.org

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

W Kang, M Qu, J Kini, Y Wei, M Shah, Y Yan - arXiv preprint arXiv …, 2024 - arxiv.org

In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or
intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object …

相关文章所有 2 个版本

[PDF] openreview.net

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Z Wang, Z Zhang, X Cheng, R Huang, L Liu… - Forty-first International … - openreview.net

Unified multi-model representation spaces are the foundation of multimodal understanding
and generation. However, the billions of model parameters and catastrophic forgetting …

被引用次数：1 相关文章

高级搜索

QQ 群