L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes their 2D …
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is …
F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and computational systems. However multimodal learning with touch remains challenging due to …
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on natural language processing but are yet to fully embrace the realm of 3D …
Abstract Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation but their understanding of the 3D world is notably …
JC Shi, M Wang, HB Duan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene …
Abstract 3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive …
Current approaches for 3D scene graph prediction rely on labeled datasets to train models for a fixed set of known object classes and relationship categories. We present Open3DSG …