Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

S Luo, W Chen, W Tian, R Liu, L Hou, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation models have indeed made a profound impact on various fields, emerging as
pivotal components that significantly shape the capabilities of intelligent systems. In the …

Embodied understanding of driving scenarios

Y Zhou, L Huang, Q Bu, J Zeng, T Li, H Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org
Embodied scene understanding serves as the cornerstone for autonomous agents to
perceive, interpret, and respond to open driving scenarios. Such understanding is typically …

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

X Ding, J Han, H Xu, X Liang… - Proceedings of the …, 2024 - openaccess.thecvf.com
The rise of multimodal large language models (MLLMs) has spurred interest in language-
based driving tasks. However existing research typically focuses on limited tasks and often …

On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving

L Wen, X Yang, D Fu, X Wang, P Cai, X Li, T Ma… - arXiv preprint arXiv …, 2023 - arxiv.org
The pursuit of autonomous driving technology hinges on the sophisticated integration of
perception, decision-making, and control systems. Traditional approaches, both data-driven …

Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models

TH Wang, A Maalouf, W Xiao, Y Ban, A Amini… - arXiv preprint arXiv …, 2023 - arxiv.org
As autonomous driving technology matures, end-to-end methodologies have emerged as a
leading strategy, promising seamless integration from perception to control via deep …

Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

HS Choi, J Jeong, YH Cho, KJ Yoon, JH Kim - arXiv preprint arXiv …, 2023 - arxiv.org
Sensor fusion approaches for intelligent self-driving agents remain key to driving scene
understanding given visual global contexts acquired from input sensors. Specifically, for the …

SRSU: An Online Road Map Detection and Network Estimation for Structured Bird's-Eye View Road Scene Understanding

P Jia, Y Jiang, Z Ju, J Qi, Z Zang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Autonomous driving requires a structured understanding of the surrounding road maps and
networks to navigate. However, considering the flexibility of autonomous vehicles and the …

CarDreamer: Open-Source Learning Platform for World Model based Autonomous Driving

D Gao, S Cai, H Zhou, H Wang, I Soltani… - arXiv preprint arXiv …, 2024 - arxiv.org
To safely navigate intricate real-world scenarios, autonomous vehicles must be able to
adapt to diverse road conditions and anticipate future events. World model (WM) based …

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

X Cao, T Zhou, Y Ma, W Ye, C Cui… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language generative AI has demonstrated remarkable promise for empowering cross-
modal scene understanding of autonomous driving and high-definition (HD) map systems …

Selma: Semantic large-scale multimodal acquisitions in variable weather, daytime and viewpoints

P Testolina, F Barbato, U Michieli… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Accurate scene understanding from multiple sensors mounted on cars is a key requirement
for autonomous driving systems. Nowadays, this task is mainly performed through data …