Scaffolding coordinates to promote vision-language coordination in large multi-modal models

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org
State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

Explore until Confident: Efficient Exploration for Embodied Question Answering

AZ Ren, J Clark, A Dixit, M Itkina, A Majumdar… - arXiv preprint arXiv …, 2024 - arxiv.org
We consider the problem of Embodied Question Answering (EQA), which refers to settings
where an embodied agent such as a robot needs to actively explore an environment to …

Multi-Object Hallucination in Vision-Language Models

X Chen, Z Ma, X Zhang, S Xu, S Qian, J Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

AS Chen, AM Lessing, A Tang, G Chada… - arXiv preprint arXiv …, 2024 - arxiv.org
Legged robots are physically capable of navigating a diverse variety of environments and
overcoming a wide range of obstructions. For example, in a search and rescue mission, a …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

Generative Image as Action Models

M Shridhar, YL Lo, S James - arXiv preprint arXiv:2407.07875, 2024 - arxiv.org
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as
image-editing and novel view synthesis. Can we similarly unlock image-generation models …

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision
and language tasks. However, their ability to reason about spatial arrangements remains …

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Z Chen, Z Shi, X Lu, L He, S Qian, HS Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
The ultimate goals of robotic learning is to acquire a comprehensive and generalizable
robotic system capable of performing both seen skills within the training distribution and …

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

W Yuan, J Duan, V Blukis, W Pumacay… - arXiv preprint arXiv …, 2024 - arxiv.org
From rearranging objects on a table to putting groceries into shelves, robots must plan
precise action points to perform tasks accurately and reliably. In spite of the recent adoption …