Pivot: Iterative visual prompting elicits actionable knowledge for vlms

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org

State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Explore until Confident: Efficient Exploration for Embodied Question Answering

AZ Ren, J Clark, A Dixit, M Itkina, A Majumdar… - arXiv preprint arXiv …, 2024 - arxiv.org

We consider the problem of Embodied Question Answering (EQA), which refers to settings
where an embodied agent such as a robot needs to actively explore an environment to …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Multi-Object Hallucination in Vision-Language Models

X Chen, Z Ma, X Zhang, S Xu, S Qian, J Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

AS Chen, AM Lessing, A Tang, G Chada… - arXiv preprint arXiv …, 2024 - arxiv.org

Legged robots are physically capable of navigating a diverse variety of environments and
overcoming a wide range of obstructions. For example, in a search and rescue mission, a …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Generative Image as Action Models

M Shridhar, YL Lo, S James - arXiv preprint arXiv:2407.07875, 2024 - arxiv.org

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as
image-editing and novel view synthesis. Can we similarly unlock image-generation models …

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision
and language tasks. However, their ability to reason about spatial arrangements remains …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Z Chen, Z Shi, X Lu, L He, S Qian, HS Fang… - arXiv preprint arXiv …, 2024 - arxiv.org

The ultimate goals of robotic learning is to acquire a comprehensive and generalizable
robotic system capable of performing both seen skills within the training distribution and …

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

W Yuan, J Duan, V Blukis, W Pumacay… - arXiv preprint arXiv …, 2024 - arxiv.org

From rearranging objects on a table to putting groceries into shelves, robots must plan
precise action points to perform tasks accurately and reliably. In spite of the recent adoption …

被引用次数：1 相关文章

高级搜索

QQ 群