Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - arXiv preprint arXiv:2403.03174, 2024 - arxiv.org
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Vision-language foundation models as effective robot imitators

X Li, M Liu, H Zhang, C Yu, J Xu, H Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent progress in vision language foundation models has shown their ability to understand
multimodal data and resolve complicated vision language tasks, including robotics …

Open-world object manipulation using pre-trained vision-language models

A Stone, T Xiao, Y Lu, K Gopalakrishnan… - arXiv preprint arXiv …, 2023 - arxiv.org
For robots to follow instructions from people, they must be able to connect the rich semantic
information in human vocabulary, eg" can you get me the pink stuffed whale?" to their …

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

F Liu, F Yan, L Zheng, C Feng, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel
paradigm, aiming to enhance the model's ability to generalize to new objects and …

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu, B Ichter… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models

S Huang, I Ponomarenko, Z Jiang, X Li, X Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
The integration of Multimodal Large Language Models (MLLMs) with robotic systems has
significantly enhanced the ability of robots to interpret and act upon natural language …

Vlmbench: A compositional benchmark for vision-and-language manipulation

K Zheng, X Chen, OC Jenkins… - Advances in Neural …, 2022 - proceedings.neurips.cc
Benefiting from language flexibility and compositionality, humans naturally intend to use
language to command an embodied agent for complex tasks such as navigation and object …

Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration

N Wake, A Kanehira, K Sasabuchi, J Takamatsu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V
(ision), by integrating observations of human actions to facilitate robotic manipulation. This …

Towards generalizable zero-shot manipulation via translating human interaction plans

H Bharadhwaj, A Gupta, V Kumar, S Tulsiani - arXiv preprint arXiv …, 2023 - arxiv.org
We pursue the goal of developing robots that can interact zero-shot with generic unseen
objects via a diverse repertoire of manipulation skills and show how passive human videos …