PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

相关文章所有 2 个版本

[PDF] arxiv.org

Explore until Confident: Efficient Exploration for Embodied Question Answering

AZ Ren, J Clark, A Dixit, M Itkina, A Majumdar… - arXiv preprint arXiv …, 2024 - arxiv.org

We consider the problem of Embodied Question Answering (EQA), which refers to settings
where an embodied agent such as a robot needs to actively explore an environment to …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

相关文章所有 2 个版本

[PDF] arxiv.org

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision
and language tasks. However, their ability to reason about spatial arrangements remains …

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Z Chen, Z Shi, X Lu, L He, S Qian, HS Fang… - arXiv preprint arXiv …, 2024 - arxiv.org

The ultimate goals of robotic learning is to acquire a comprehensive and generalizable
robotic system capable of performing both seen skills within the training distribution and …

相关文章所有 2 个版本

[PDF] arxiv.org

Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org

State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

AJ Sathyamoorthy, K Weerakoon, M Elnoor… - arXiv preprint arXiv …, 2024 - arxiv.org

We present ConVOI, a novel method for autonomous robot navigation in real-world indoor
and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two …

[PDF] arxiv.org

Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?

YH Liao, R Mahmood, S Fidler, D Acuna - arXiv preprint arXiv:2404.06510, 2024 - arxiv.org

Enhancing semantic grounding abilities in Vision-Language Models (VLMs) often involves
collecting domain-specific training data, refining the network architectures, or modifying the …

相关文章所有 2 个版本

[PDF] arxiv.org

MATRIX: Multi-Agent Trajectory Generation with Diverse Contexts

Z Xu, R Zhou, Y Yin, H Gao, M Tomizuka… - arXiv preprint arXiv …, 2024 - arxiv.org

Data-driven methods have great advantages in modeling complicated human behavioral
dynamics and dealing with many human-robot interaction applications. However, collecting …

相关文章所有 2 个版本

[PDF] arxiv.org

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

H Zhou, T Ji, J Zhang, F Sun, H Xu - arXiv preprint arXiv:2406.10157, 2024 - arxiv.org

Minigolf, a game with countless court layouts, and complex ball motion, constitutes a
compelling real-world testbed for the study of embodied intelligence. As it not only …

高级搜索

QQ 群