Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Vision language models in autonomous driving and intelligent transportation systems

X Zhou, M Liu, BL Zagar, E Yurtsever… - arXiv preprint arXiv …, 2023 - arxiv.org
The applications of Vision-Language Models (VLMs) in the fields of Autonomous Driving
(AD) and Intelligent Transportation Systems (ITS) have attracted widespread attention due to …

Unipt: Universal parallel tuning for transfer learning with efficient parameter and memory

H Diao, B Wan, Y Zhang, X Jia… - Proceedings of the …, 2024 - openaccess.thecvf.com
Parameter-efficient transfer learning (PETL) ie fine-tuning a small portion of parameters is an
effective strategy for adapting pre-trained models to downstream domains. To further reduce …

From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos

Y Chen, J Li, S Shan, M Wang, R Hong - arXiv preprint arXiv:2312.05447, 2023 - arxiv.org
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations,
eg, insufficient quantity and diversity of pose, occlusion and illumination, as well as the …

Vision language models in autonomous driving: A survey and outlook

X Zhou, M Liu, E Yurtsever, BL Zagar… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
The applications of Vision-Language Models (VLMs) in the field of Autonomous Driving (AD)
have attracted widespread attention due to their outstanding performance and the ability to …

Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities

H Tang, Y Hu, Y Wang, S Zhang, M Xu, J Zhu… - Information Fusion, 2024 - Elsevier
In the era of smart cities, the advent of the Internet of Things technology has catalyzed the
proliferation of multimodal sensor data, presenting new challenges in cross-modal event …

GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

X Zhou, AC Knoll - arXiv preprint arXiv:2402.02205, 2024 - arxiv.org
The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic
of paramount importance in the realm of intelligent transportation systems and intelligent …

CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering

K Guo, D Tian, Y Hu, C Lin, Z Qian… - … on Circuits and …, 2024 - ieeexplore.ieee.org
Traffic video question answering (TrafficVQA) constitutes a specialized VideoQA task
designed to enhance the basic comprehension and intricate reasoning capacities of videos …

Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

SX Zhang, H Wang, X Zhu, W Gu, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Video-language alignment is a crucial multi-modal task that benefits various downstream
applications, eg, video-text retrieval and video question answering. Existing methods either …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …