Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Y Luo, H Lin, X Zheng, Y Jiang, F Chao, J Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in
various 3D applications, which require both shared and complementary information in …

Toward explainable and fine-grained 3d grounding through referring textual phrases

Z Yuan, X Yan, Z Li, X Li, Y Guo, S Cui, Z Li - arXiv preprint arXiv …, 2022 - arxiv.org
Recent progress in 3D scene understanding has explored visual grounding (3DVG) to
localize a target object through a language description. However, existing methods only …

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving

Y Wang, JH Cheng, JT Huang, SY Kuan, Q Fu… - arXiv preprint arXiv …, 2023 - arxiv.org
Sensor fusion is crucial for an accurate and robust perception system on autonomous
vehicles. Most existing datasets and perception solutions focus on fusing cameras and …

Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes

A Abdelreheem, K Olszewski, HY Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
The two popular datasets ScanRefer [20] and ReferIt3D [5] connect natural language to real-
world 3D scenes. In this paper, we curate a complementary dataset extending both the …

Lerf: Language embedded radiance fields

J Kerr, CM Kim, K Goldberg… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans describe the physical world using natural language to refer to specific 3D locations
based on a vast range of properties: visual appearance, semantics, abstract associations, or …

Watervg: Waterway visual grounding based on text-guided vision and mmwave radar

R Guan, L Jia, F Yang, S Yao, E Purwanto… - arXiv preprint arXiv …, 2024 - arxiv.org
The perception of waterways based on human intent holds significant importance for
autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water …

X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks

Z Qian, Y Ma, J Ji, X Sun - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Referring 3D instance segmentation is a challenging task aimed at accurately segmenting a
target instance within a 3D scene based on a given referring expression. However, previous …

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …