Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

R Liao, M Erler, H Wang, G Zhai, G Zhang, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
In the video-language domain, recent works in leveraging zero-shot Large Language Model-
based reasoning for video understanding have become competitive challengers to previous …

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

B Zhao, LP Dirac, P Varshavskaya - arXiv preprint arXiv:2409.17080, 2024 - arxiv.org
Large vision-language models (VLMs) have become state-of-the-art for many computer
vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But …

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

J Kim, H Chung, BH Kim - arXiv preprint arXiv:2411.06869, 2024 - arxiv.org
Category-agnostic pose estimation (CAPE) has traditionally relied on support images with
annotated keypoints, a process that is often cumbersome and may fail to fully capture the …

On Erroneous Agreements of CLIP Image Embeddings

S Li, PW Koh, SS Du - arXiv preprint arXiv:2411.05195, 2024 - arxiv.org
Recent research suggests that the failures of Vision-Language Models (VLMs) at visual
reasoning often stem from erroneous agreements--when semantically distinct images are …

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

G Wardle, T Susnjak - arXiv preprint arXiv:2410.03062, 2024 - arxiv.org
This paper examines how the sequencing of images and text within multi-modal prompts
influences the reasoning performance of large language models (LLMs). We performed …

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

M Nie, D Ding, C Wang, Y Guo, J Han… - The Thirty-eighth …, 2024 - openreview.net
Large language models (LLMs) have demonstrated exceptional capabilities in text
understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to …

Navigate Complex Physical Worlds via Geometrically Constrained LLM

Y Huang, W Ye, L Li, J Zhao - arXiv preprint arXiv:2410.17529, 2024 - arxiv.org
This study investigates the potential of Large Language Models (LLMs) for reconstructing
and constructing the physical world solely based on textual knowledge. It explores the …