相关文章- 学术资源搜索

Sqa3d: Situated question answering in 3d scenes

X Ma, S Yong, Z Zheng, Q Li, Y Liang, SC Zhu… - arXiv preprint arXiv …, 2022 - arxiv.org

We propose a new task to benchmark scene understanding of embodied agents: Situated
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …

被引用次数：57 相关文章所有 5 个版本

Complete 3d relationships extraction modality alignment network for 3d dense captioning

A Mao, Z Yang, W Chen, R Yi… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

3D dense captioning aims to semantically describe each object detected in a 3D scene,
which plays a significant role in 3D scene understanding. Previous works lack a complete …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

B Jin, Y Zheng, P Li, W Li, Y Zheng, S Hu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

3D dense captioning stands as a cornerstone in achieving a comprehensive understanding
of 3D scenes through natural language. It has recently witnessed remarkable achievements …

被引用次数：2 相关文章所有 2 个版本

GRC-Net: Fusing GAT-Based 4D Radar and Camera for 3D Object Detection

L Fan, C Zeng, Y Li, X Wang, D Cao - 2023 - sae.org

The fusion of multi-modal perception in autonomous driving plays a pivotal role in vehicle
behavior decision-making. However, much of the previous research has predominantly …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

M Li, X Chen, C Zhang, S Chen, H Zhu, F Yin… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, 3D understanding has become popular to facilitate autonomous agents to perform
further decisionmaking. However, existing 3D datasets and methods are often limited to …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

被引用次数：50 相关文章所有 2 个版本

[PDF] thecvf.com

Multi-task collaborative network for joint referring expression comprehension and segmentation

G Luo, Y Zhou, X Sun, L Cao, C Wu… - Proceedings of the …, 2020 - openaccess.thecvf.com

Referring expression comprehension (REC) and segmentation (RES) are two highly-related
tasks, which both aim at identifying the referent according to a natural language expression …

被引用次数：258 相关文章所有 10 个版本

[HTML] sciencedirect.com

[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier

In the field of autonomous vehicles (AVs), accurately discerning commander intent and
executing linguistic commands within a visual context presents a significant challenge. This …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Unified Scene Representation and Reconstruction for 3D Large Language Models

T Chu, P Zhang, X Dong, Y Zang, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Enabling Large Language Models (LLMs) to interact with 3D environments is challenging.
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D …

Knowing where to leverage: Context-aware graph convolutional network with an adaptive fusion layer for contextual spoken language understanding

L Qin, W Che, M Ni, Y Li, T Liu - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org

Spoken language understanding (SLU) systems aim to understand users' utterance, which
is a key component of task-oriented dialogue systems. In this paper, we focus on improving …

被引用次数：19 相关文章所有 3 个版本

高级搜索

QQ 群

Sqa3d: Situated question answering in 3d scenes

Complete 3d relationships extraction modality alignment network for 3d dense captioning

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

GRC-Net: Fusing GAT-Based 4D Radar and Camera for 3D Object Detection

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Multi-task collaborative network for joint referring expression comprehension and segmentation

[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Unified Scene Representation and Reconstruction for 3D Large Language Models

Knowing where to leverage: Context-aware graph convolutional network with an adaptive fusion layer for contextual spoken language understanding

引用