Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration

N Wake, A Kanehira, K Sasabuchi, J Takamatsu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V
(ision), by integrating observations of human actions to facilitate robotic manipulation. This …

Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation

C Jin, W Tan, J Yang, B Liu, R Song, L Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose a novel framework for learning high-level cognitive capabilities in robot
manipulation tasks, such as making a smiley face using building blocks. These tasks often …

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - arXiv preprint arXiv:2403.03174, 2024 - arxiv.org
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Gesture-informed robot assistance via foundation models

LH Lin, Y Cui, Y Hao, F Xia, D Sadigh - 7th Annual Conference on …, 2023 - openreview.net
Gestures serve as a fundamental and significant mode of non-verbal communication among
humans. Deictic gestures (such as pointing towards an object), in particular, offer valuable …

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning

Y Hu, F Lin, T Zhang, L Yi, Y Gao - arXiv preprint arXiv:2311.17842, 2023 - arxiv.org
In this study, we are interested in imbuing robots with the capability of physically-grounded
task planning. Recent advancements have shown that large language models (LLMs) …

Open-world object manipulation using pre-trained vision-language models

A Stone, T Xiao, Y Lu, K Gopalakrishnan… - arXiv preprint arXiv …, 2023 - arxiv.org
For robots to follow instructions from people, they must be able to connect the rich semantic
information in human vocabulary, eg" can you get me the pink stuffed whale?" to their …

Embodied task planning with large language models

Z Wu, Z Wang, X Xu, J Lu, H Yan - arXiv preprint arXiv:2307.01848, 2023 - arxiv.org
Equipping embodied agents with commonsense is important for robots to successfully
complete complex human instructions in general environments. Recent large language …

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu, B Ichter… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

Structured world models from human videos

R Mendonca, S Bahl, D Pathak - arXiv preprint arXiv:2308.10901, 2023 - arxiv.org
We tackle the problem of learning complex, general behaviors directly in the real world. We
propose an approach for robots to efficiently learn manipulation skills using only a handful of …

Assistive tele-op: Leveraging transformers to collect robotic task demonstrations

HM Clever, A Handa, H Mazhar, K Parker… - arXiv preprint arXiv …, 2021 - arxiv.org
Sharing autonomy between robots and human operators could facilitate data collection of
robotic task demonstrations to continuously improve learned models. Yet, the means to …