Spatiality-guided transformer for 3d dense captioning on point clouds

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

被引用次数：47 相关文章所有 5 个版本

[PDF] thecvf.com

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …

被引用次数：14 相关文章所有 3 个版本

[PDF] thecvf.com

Clip-guided vision-language pre-training for question answering in 3d scenes

M Parelli, A Delitzas, N Hars… - Proceedings of the …, 2023 - openaccess.thecvf.com

Training models to apply linguistic knowledge and visual concepts from 2D images to 3D
world understanding is a promising direction that researchers have only recently started to …

被引用次数：22 相关文章所有 7 个版本

[PDF] thecvf.com

End-to-end 3d dense captioning with vote2cap-detr

S Chen, H Zhu, X Chen, Y Lei… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract 3D dense captioning aims to generate multiple captions localized with their
associated object regions. Existing methods follow a sophisticated" detect-then-describe" …

被引用次数：25 相关文章所有 7 个版本

[PDF] arxiv.org

A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes

T Yu, X Lin, S Wang, W Sheng… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that
aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

Recent advances in multi-modal 3D scene understanding: A comprehensive survey and evaluation

Y Lei, Z Wang, F Chen, G Wang, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Multi-modal 3D scene understanding has gained considerable attention due to its wide
applications in many areas, such as autonomous driving and human-computer interaction …

被引用次数：3 相关文章所有 2 个版本

[PDF] thecvf.com

Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes

A Abdelreheem, K Olszewski, HY Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com

The two popular datasets ScanRefer [20] and ReferIt3D [5] connect natural language to real-
world 3D scenes. In this paper, we curate a complementary dataset extending both the …

被引用次数：17 相关文章所有 7 个版本

[PDF] arxiv.org

Dense object grounding in 3d scenes

W Huang, D Liu, W Hu - Proceedings of the 31st ACM International …, 2023 - dl.acm.org

Localizing objects in 3D scenes according to the semantics of a given natural language is a
fundamental yet important task in the field of multimedia understanding, which benefits …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Vote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning

S Chen, H Zhu, M Li, X Chen, P Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

3D dense captioning requires a model to translate its understanding of an input 3D scene
into several captions associated with different object regions. Existing methods adopt a …

被引用次数：3 相关文章所有 6 个版本

[PDF] aaai.org

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

T Zhang, S He, T Dai, Z Wang, B Chen… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

In recent years, vision language pre-training frameworks have made significant progress in
natural language processing and computer vision, achieving remarkable performance …

被引用次数：6 相关文章所有 4 个版本

高级搜索

QQ 群