An embodied generalist agent in 3d world

J Huang, S Yong, X Ma, X Linghu, P Li, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Leveraging massive knowledge and learning schemes from large language models (LLMs),
recent machine learning models show notable successes in building generalist agents that …

MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors

Y Tang, X Han, X Li, Q Yu, Y Hao, L Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging
Large Language Models (LLMs) with images using a simple projector. Inspired by their …

Empowering 3D Visual Grounding with Reasoning Capabilities

C Zhu, T Wang, W Zhang, K Chen, X Liu - arXiv preprint arXiv:2407.01525, 2024 - arxiv.org
Although great progress has been made in 3D visual grounding, current models still rely on
explicit textual descriptions for grounding and lack the ability to reason human intentions …

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

B Jin, Y Zheng, P Li, W Li, Y Zheng, S Hu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
3D dense captioning stands as a cornerstone in achieving a comprehensive understanding
of 3D scenes through natural language. It has recently witnessed remarkable achievements …

A survey of label-efficient deep learning for 3D point clouds

A Xiao, X Zhang, L Shao, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
In the past decade, deep neural networks have achieved significant progress in point cloud
learning. However, collecting large-scale precisely-annotated point clouds is extremely …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

MeshXL: Neural Coordinate Field for Generative 3D Foundation Models

S Chen, X Chen, A Pang, X Zeng, W Cheng… - arXiv preprint arXiv …, 2024 - arxiv.org
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed,
and storage efficiency, which is widely preferred in various applications. However, given its …

View selection for 3d captioning via diffusion ranking

T Luo, J Johnson, H Lee - arXiv preprint arXiv:2404.07984, 2024 - arxiv.org
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets,
facilitating a broader range of applications. However, existing methods sometimes lead to …

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …