Cityllava: Efficient fine-tuning for vlms in city scenario

文章

学术资源搜索

获得 4 条结果（用时0.02秒）

我的图书馆

Cityllava: Efficient fine-tuning for vlms in city scenario

在引用文章中搜索

[PDF] thecvf.com

The 8th AI City Challenge

S Wang, DC Anastasiu, Z Tang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The eighth AI City Challenge highlighted the convergence of computer vision and
artificial intelligence in areas like retail warehouse settings and Intelligent Traffic Systems …

被引用次数：33 相关文章所有 5 个版本

[PDF] arxiv.org

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y Xia, X Li, Z Xia, A Chang, T Yu… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Number it: Temporal Grounding Videos like Flipping Manga

Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Examining the commitments and difficulties inherent in multimodal foundation models for street view imagery

Z Yang, X Lin, Q He, Z Huang, Z Liu, H Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org

The emergence of Large Language Models (LLMs) and multimodal foundation models
(FMs) has generated heightened interest in their applications that integrate vision and …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群

Cityllava: Efficient fine-tuning for vlms in city scenario

The 8th AI City Challenge

Visual prompting in multimodal large language models: A survey

Number it: Temporal Grounding Videos like Flipping Manga

Examining the commitments and difficulties inherent in multimodal foundation models for street view imagery

引用