Improving multi-party dialogue discourse parsing via domain integration

Z Liu, NF Chen - arXiv preprint arXiv:2110.04526, 2021 - arxiv.org
While multi-party conversations are often less structured than monologues and documents,
they are implicitly organized by semantic level correlations across the interactive turns, and …

Contextual modeling for 3d dense captioning on point clouds

Y Zhong, L Xu, J Luo, L Ma - arXiv preprint arXiv:2210.03925, 2022 - arxiv.org
3D dense captioning, as an emerging vision-language task, aims to identify and locate each
object from a set of point clouds and generate a distinctive natural language sentence for …

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent

J Yang, X Chen, S Qian, N Madaan, M Iyengar… - arXiv preprint arXiv …, 2023 - arxiv.org
3D visual grounding is a critical skill for household robots, enabling them to navigate,
manipulate objects, and answer questions based on their environment. While existing …

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

X Ding, J Han, H Xu, W Zhang, X Li - arXiv preprint arXiv:2309.05186, 2023 - arxiv.org
Autonomous driving systems generally employ separate models for different tasks resulting
in intricate designs. For the first time, we leverage singular multimodal large language …

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

R Korekata, M Kambara, Y Yoshida… - 2023 IEEE/RSJ …, 2023 - ieeexplore.ieee.org
This paper describes a domestic service robot (DSR) that fetches everyday objects and
carries them to specified destinations according to free-form natural language instructions …

Channel-Aware Decoupling Network for Multiturn Dialog Comprehension

Z Zhang, H Zhao, L Liu - IEEE Transactions on Neural …, 2022 - ieeexplore.ieee.org
Training machines to understand natural language and interact with humans is one of the
major goals of artificial intelligence. Recent years have witnessed an evolution from …

Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception

Z Yu, W Wan, M Ren, X Zheng… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In the context of autonomous driving environment perception, multi-modal fusion plays a
pivotal role in enhancing robustness, completeness, and accuracy, thereby extending the …

Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion

Y Liu, F Wang, N Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability
to bad weather. Nevertheless, the radar detection performance is usually inferior because its …

Can 3D Vision-Language Models Truly Understand Natural Language?

W Deng, R Ding, J Yang, J Liu, Y Li, X Qi… - arXiv preprint arXiv …, 2024 - arxiv.org
Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues
for human interaction with embodied agents or robots using natural language. Despite this …

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

D Zhenyu Chen, A Gholami, M Nießner… - arXiv e …, 2020 - ui.adsabs.harvard.edu
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As
input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes …