Cityllava: Efficient fine-tuning for vlms in city scenario

Z Duan, H Cheng, D Xu, X Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
In the vast and dynamic landscape of urban settings Traffic Safety Description and Analysis
plays a pivotal role in applications ranging from insurance inspection to accident prevention …

Multi-perspective traffic video description model with fine-grained refinement approach

TA To, MN Tran, TB Ho, TL Ha… - Proceedings of the …, 2024 - openaccess.thecvf.com
The analysis of traffic patterns is crucial for enhancing safety and optimizing flow within
urban cities. While urban cities possess extensive camera networks for monitoring the raw …

Divide and conquer boosting for enhanced traffic safety description and analysis with large vision language model

KT Xuan, KN Nguyen, BH Ngo… - Proceedings of the …, 2024 - openaccess.thecvf.com
The increasing complexity of traffic dynamics has underscored the necessity for advanced
traffic safety description and analysis challenging the efficacy of current methodologies in …

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Trafficvlm: A controllable visual language model for traffic video captioning

QM Dinh, MK Ho, AQ Dang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Traffic video description and analysis have received much attention recently due to the
growing demand for efficient and reliable urban surveillance systems. Most existing methods …

Regiongpt: Towards region understanding vision language model

Q Guo, S De Mello, H Yin, W Byeon… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision language models (VLMs) have experienced rapid advancements through the
integration of large language models (LLMs) with image-text pairs yet they struggle with …

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

A Gopalkrishnan, R Greer, M Trivedi - arXiv preprint arXiv:2403.19838, 2024 - arxiv.org
Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have
become prominent in autonomous driving research, as these models can provide …

Probing conceptual understanding of large visual-language models

M Schiappa, R Abdullah, S Azad… - Proceedings of the …, 2024 - openaccess.thecvf.com
In recent years large visual-language (V+ L) models have achieved great success in various
downstream tasks. However it is not well studied whether these models have a conceptual …

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

K Zhou, K Lee, T Misu, XE Wang - arXiv preprint arXiv:2310.05872, 2023 - arxiv.org
In our work, we explore the synergistic capabilities of pre-trained vision-and-language
models (VLMs) and large language models (LLMs) for visual commonsense reasoning …