Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

Explore until Confident: Efficient Exploration for Embodied Question Answering

AZ Ren, J Clark, A Dixit, M Itkina, A Majumdar… - arXiv preprint arXiv …, 2024 - arxiv.org
We consider the problem of Embodied Question Answering (EQA), which refers to settings
where an embodied agent such as a robot needs to actively explore an environment to …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision
and language tasks. However, their ability to reason about spatial arrangements remains …

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Z Chen, Z Shi, X Lu, L He, S Qian, HS Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
The ultimate goals of robotic learning is to acquire a comprehensive and generalizable
robotic system capable of performing both seen skills within the training distribution and …

Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org
State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

AJ Sathyamoorthy, K Weerakoon, M Elnoor… - arXiv preprint arXiv …, 2024 - arxiv.org
We present ConVOI, a novel method for autonomous robot navigation in real-world indoor
and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two …

Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?

YH Liao, R Mahmood, S Fidler, D Acuna - arXiv preprint arXiv:2404.06510, 2024 - arxiv.org
Enhancing semantic grounding abilities in Vision-Language Models (VLMs) often involves
collecting domain-specific training data, refining the network architectures, or modifying the …

MATRIX: Multi-Agent Trajectory Generation with Diverse Contexts

Z Xu, R Zhou, Y Yin, H Gao, M Tomizuka… - arXiv preprint arXiv …, 2024 - arxiv.org
Data-driven methods have great advantages in modeling complicated human behavioral
dynamics and dealing with many human-robot interaction applications. However, collecting …

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

H Zhou, T Ji, J Zhang, F Sun, H Xu - arXiv preprint arXiv:2406.10157, 2024 - arxiv.org
Minigolf, a game with countless court layouts, and complex ball motion, constitutes a
compelling real-world testbed for the study of embodied intelligence. As it not only …