The 8th AI City Challenge

S Wang, DC Anastasiu, Z Tang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The eighth AI City Challenge highlighted the convergence of computer vision and
artificial intelligence in areas like retail warehouse settings and Intelligent Traffic Systems …

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y Xia, X Li, Z Xia, A Chang, T Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

Number it: Temporal Grounding Videos like Flipping Manga

Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

Examining the commitments and difficulties inherent in multimodal foundation models for street view imagery

Z Yang, X Lin, Q He, Z Huang, Z Liu, H Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of Large Language Models (LLMs) and multimodal foundation models
(FMs) has generated heightened interest in their applications that integrate vision and …