MixCon3D: Synergizing Multi-View and Cross-Modal Contrastive Learning for Enhancing 3D Representation

Y Gao, Z Wang, WS Zheng, C Xie, Y Zhou - arXiv preprint arXiv …, 2023 - arxiv.org
Contrastive learning has emerged as a promising paradigm for 3D open-world
understanding, jointly with text, image, and point cloud. In this paper, we introduce …

[PDF][PDF] UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding Supplementary Material

DZ Chen, R Hu, X Chen, M Nießner, AX Chang - openaccess.thecvf.com
In this supplementary material, we provide detailed dense captioning results on the
ScanRefer dataset in Sec. 1. To showcase the effectiveness of the proposed pre-training …

Semi-automatic annotation of 3D Radar and Camera for Smart Infrastructure-based perception

S Agrawal, S Bhanderi, G Elger - IEEE Access, 2024 - ieeexplore.ieee.org
Environment perception using camera, radar, and/or lidar sensors has significantly improved
in the last few years because of deep learning-based methods. However, a large group of …

DNet: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

DZ Chen, Q Wu, M Nießner, AX Chang - European Conference on …, 2022 - Springer
Recent work on dense captioning and visual grounding in 3D have achieved impressive
results. Despite developments in both areas, the limited amount of available 3D vision …

Language prompt for autonomous driving

D Wu, W Han, T Wang, Y Liu, X Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
A new trend in the computer vision community is to capture objects of interest following
flexible human command represented by a natural language prompt. However, the progress …

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

MEA Boudjoghra, A Dai, J Lahoud, H Cholakkal… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent works on open-vocabulary 3D instance segmentation show strong promise, but at
the cost of slow inference speed and high computation requirements. This high computation …

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

Ml-persref: A machine learning-based personalized multimodal fusion approach for referencing outside objects from a moving vehicle

A Gomaa, G Reyes, M Feld - … of the 2021 International Conference on …, 2021 - dl.acm.org
Over the past decades, the addition of hundreds of sensors to modern vehicles has led to an
exponential increase in their capabilities. This allows for novel approaches to interaction …

Grounding linguistic commands to navigable regions

N Rufus, K Jain, UKR Nair, V Gandhi… - 2021 IEEE/RSJ …, 2021 - ieeexplore.ieee.org
Humans have a natural ability to effortlessly comprehend linguistic commands such as “park
next to the yellow sedan” and instinctively know which region of the road the vehicle should …