Sqa3d: Situated question answering in 3d scenes

X Ma, S Yong, Z Zheng, Q Li, Y Liang, SC Zhu… - arXiv preprint arXiv …, 2022 - arxiv.org
We propose a new task to benchmark scene understanding of embodied agents: Situated
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …

Complete 3d relationships extraction modality alignment network for 3d dense captioning

A Mao, Z Yang, W Chen, R Yi… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
3D dense captioning aims to semantically describe each object detected in a 3D scene,
which plays a significant role in 3D scene understanding. Previous works lack a complete …

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

B Jin, Y Zheng, P Li, W Li, Y Zheng, S Hu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
3D dense captioning stands as a cornerstone in achieving a comprehensive understanding
of 3D scenes through natural language. It has recently witnessed remarkable achievements …

GRC-Net: Fusing GAT-Based 4D Radar and Camera for 3D Object Detection

L Fan, C Zeng, Y Li, X Wang, D Cao - 2023 - sae.org
The fusion of multi-modal perception in autonomous driving plays a pivotal role in vehicle
behavior decision-making. However, much of the previous research has predominantly …

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

M Li, X Chen, C Zhang, S Chen, H Zhu, F Yin… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, 3D understanding has become popular to facilitate autonomous agents to perform
further decisionmaking. However, existing 3D datasets and methods are often limited to …

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

Multi-task collaborative network for joint referring expression comprehension and segmentation

G Luo, Y Zhou, X Sun, L Cao, C Wu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Referring expression comprehension (REC) and segmentation (RES) are two highly-related
tasks, which both aim at identifying the referent according to a natural language expression …

[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier
In the field of autonomous vehicles (AVs), accurately discerning commander intent and
executing linguistic commands within a visual context presents a significant challenge. This …

Unified Scene Representation and Reconstruction for 3D Large Language Models

T Chu, P Zhang, X Dong, Y Zang, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging.
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D …

Knowing where to leverage: Context-aware graph convolutional network with an adaptive fusion layer for contextual spoken language understanding

L Qin, W Che, M Ni, Y Li, T Liu - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org
Spoken language understanding (SLU) systems aim to understand users' utterance, which
is a key component of task-oriented dialogue systems. In this paper, we focus on improving …