Learning to localize objects improves spatial reasoning in visual-llms

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arXiv preprint arXiv …, 2024 - arxiv.org

LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

R Liao, M Erler, H Wang, G Zhai, G Zhang, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

In the video-language domain, recent works in leveraging zero-shot Large Language Model-
based reasoning for video understanding have become competitive challengers to previous …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

Navigate Complex Physical Worlds via Geometrically Constrained LLM

Y Huang, W Ye, L Li, J Zhao - arXiv preprint arXiv:2410.17529, 2024 - arxiv.org

This study investigates the potential of Large Language Models (LLMs) for reconstructing
and constructing the physical world solely based on textual knowledge. It explores the …

高级搜索

QQ 群