Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …

Clip-guided vision-language pre-training for question answering in 3d scenes

M Parelli, A Delitzas, N Hars… - Proceedings of the …, 2023 - openaccess.thecvf.com
Training models to apply linguistic knowledge and visual concepts from 2D images to 3D
world understanding is a promising direction that researchers have only recently started to …

End-to-end 3d dense captioning with vote2cap-detr

S Chen, H Zhu, X Chen, Y Lei… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D dense captioning aims to generate multiple captions localized with their
associated object regions. Existing methods follow a sophisticated" detect-then-describe" …

A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes

T Yu, X Lin, S Wang, W Sheng… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that
aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents …

Recent advances in multi-modal 3D scene understanding: A comprehensive survey and evaluation

Y Lei, Z Wang, F Chen, G Wang, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Multi-modal 3D scene understanding has gained considerable attention due to its wide
applications in many areas, such as autonomous driving and human-computer interaction …

Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes

A Abdelreheem, K Olszewski, HY Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
The two popular datasets ScanRefer [20] and ReferIt3D [5] connect natural language to real-
world 3D scenes. In this paper, we curate a complementary dataset extending both the …

Dense object grounding in 3d scenes

W Huang, D Liu, W Hu - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Localizing objects in 3D scenes according to the semantics of a given natural language is a
fundamental yet important task in the field of multimedia understanding, which benefits …

Vote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning

S Chen, H Zhu, M Li, X Chen, P Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
3D dense captioning requires a model to translate its understanding of an input 3D scene
into several captions associated with different object regions. Existing methods adopt a …

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

T Zhang, S He, T Dai, Z Wang, B Chen… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In recent years, vision language pre-training frameworks have made significant progress in
natural language processing and computer vision, achieving remarkable performance …