Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation

C Jin, W Tan, J Yang, B Liu, R Song, L Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose a novel framework for learning high-level cognitive capabilities in robot
manipulation tasks, such as making a smiley face using building blocks. These tasks often …

Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning

J Li, Q Gao, M Johnston, X Gao, X He… - arXiv preprint arXiv …, 2023 - arxiv.org
Prompt-based learning has been demonstrated as a compelling paradigm contributing to
large language models' tremendous success (LLMs). Inspired by their success in language …

Closed-loop open-vocabulary mobile manipulation with gpt-4v

P Zhi, Z Zhang, M Han, Z Zhang, Z Li, Z Jiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Autonomous robot navigation and manipulation in open environments require reasoning
and replanning with closed-loop feedback. We present COME-robot, the first closed-loop …

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu, B Ichter… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

Naturalvlm: Leveraging fine-grained natural language for affordance-guided visual manipulation

R Xu, Y Shen, X Li, R Wu, H Dong - arXiv preprint arXiv:2403.08355, 2024 - arxiv.org
Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects
based on human language instructions is a pivotal challenge. Prior research has …

Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts

F Ni, J Hao, S Wu, L Kou, J Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Robotics agents often struggle to understand and follow the multi-modal prompts in complex
manipulation scenes which are challenging to be sufficiently and accurately described by …

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

M Zhu, Y Zhu, J Li, J Wen, Z Xu, Z Che, C Shen… - arXiv preprint arXiv …, 2024 - arxiv.org
The language-conditioned robotic manipulation aims to transfer natural language
instructions into executable actions, from simple pick-and-place to tasks requiring intent …

Learning neuro-symbolic programs for language guided robot manipulation

K Namasivayam, H Singh, V Bindal… - … on Robotics and …, 2023 - ieeexplore.ieee.org
Given a natural language instruction and an input scene, our goal is to train a model to
output a manipulation program that can be executed by the robot. Prior approaches for this …

Spatial-language attention policies for efficient robot learning

P Parashar, V Jain, X Zhang, J Vakil, S Powers… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite great strides in language-guided manipulation, existing work has been constrained
to table-top settings. Table-tops allow for perfect and consistent camera angles, properties …

Vima: Robot manipulation with multimodal prompts

Y Jiang, A Gupta, Z Zhang, G Wang, Y Dou, Y Chen… - 2023 - openreview.net
Prompt-based learning has emerged as a successful paradigm in natural language
processing, where a single general-purpose language model can be instructed to perform …