相关文章- 学术资源搜索

MixCon3D: Synergizing Multi-View and Cross-Modal Contrastive Learning for Enhancing 3D Representation

Y Gao, Z Wang, WS Zheng, C Xie, Y Zhou - arXiv preprint arXiv …, 2023 - arxiv.org

Contrastive learning has emerged as a promising paradigm for 3D open-world
understanding, jointly with text, image, and point cloud. In this paper, we introduce …

被引用次数：1 相关文章所有 2 个版本

[PDF] thecvf.com

[PDF][PDF] UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding Supplementary Material

DZ Chen, R Hu, X Chen, M Nießner, AX Chang - openaccess.thecvf.com

In this supplementary material, we provide detailed dense captioning results on the
ScanRefer dataset in Sec. 1. To showcase the effectiveness of the proposed pre-training …

[PDF] ieee.org

Semi-automatic annotation of 3D Radar and Camera for Smart Infrastructure-based perception

S Agrawal, S Bhanderi, G Elger - IEEE Access, 2024 - ieeexplore.ieee.org

Environment perception using camera, radar, and/or lidar sensors has significantly improved
in the last few years because of deep learning-based methods. However, a large group of …

DNet: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

DZ Chen, Q Wu, M Nießner, AX Chang - European Conference on …, 2022 - Springer

Recent work on dense captioning and visual grounding in 3D have achieved impressive
results. Despite developments in both areas, the limited amount of available 3D vision …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Language prompt for autonomous driving

D Wu, W Han, T Wang, Y Liu, X Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

A new trend in the computer vision community is to capture objects of interest following
flexible human command represented by a natural language prompt. However, the progress …

被引用次数：38 相关文章所有 2 个版本

[PDF] arxiv.org

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

被引用次数：88 相关文章所有 3 个版本

[PDF] arxiv.org

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

MEA Boudjoghra, A Dai, J Lahoud, H Cholakkal… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at
the cost of slow inference speed and high computation requirements. This high computation …

[PDF] arxiv.org

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

被引用次数：303 相关文章所有 2 个版本

[PDF] arxiv.org

Ml-persref: A machine learning-based personalized multimodal fusion approach for referencing outside objects from a moving vehicle

A Gomaa, G Reyes, M Feld - … of the 2021 International Conference on …, 2021 - dl.acm.org

Over the past decades, the addition of hundreds of sensors to modern vehicles has led to an
exponential increase in their capabilities. This allows for novel approaches to interaction …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org

Grounding linguistic commands to navigable regions

N Rufus, K Jain, UKR Nair, V Gandhi… - 2021 IEEE/RSJ …, 2021 - ieeexplore.ieee.org

Humans have a natural ability to effortlessly comprehend linguistic commands such as “park
next to the yellow sedan” and instinctively know which region of the road the vehicle should …

被引用次数：8 相关文章所有 6 个版本

高级搜索

QQ 群

MixCon3D: Synergizing Multi-View and Cross-Modal Contrastive Learning for Enhancing 3D Representation

[PDF][PDF] UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding Supplementary Material

Semi-automatic annotation of 3D Radar and Camera for Smart Infrastructure-based perception

DNet: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Language prompt for autonomous driving

Ferret: Refer and ground anything anywhere at any granularity

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Kosmos-2: Grounding multimodal large language models to the world

Ml-persref: A machine learning-based personalized multimodal fusion approach for referencing outside objects from a moving vehicle

Grounding linguistic commands to navigable regions

引用