Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Z Fu, TZ Zhao, C Finn - arXiv preprint arXiv:2401.02117, 2024 - arxiv.org
Imitation learning from human demonstrations has shown impressive performance in
robotics. However, most results focus on table-top manipulation, lacking the mobility and …

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - First Workshop on Vision …, 2024 - openreview.net
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

R Doshi, H Walke, O Mees, S Dasari… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern machine learning systems rely on large datasets to attain broad generalization, and
this often poses a challenge in robot learning, where each robotic platform and task might …

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation

CL Cheang, G Chen, Y Jing, T Kong, H Li, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable
robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture …

General-purpose foundation models for increased autonomy in robot-assisted surgery

S Schmidgall, JW Kim, A Kuntz, AE Ghazi… - Nature Machine …, 2024 - nature.com
The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific
objectives that solve a single robotic problem such as picking up an object or reaching a …

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Y Kuang, J Ye, H Geng, J Mao, C Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation,
dubbed RAM, featuring generalizability across various objects, environments, and …

Surgical robot transformer (srt): Imitation learning for surgical tasks

JW Kim, TZ Zhao, S Schmidgall, A Deguet… - arXiv preprint arXiv …, 2024 - arxiv.org
We explore whether surgical manipulation tasks can be learned on the da Vinci robot via
imitation learning. However, the da Vinci system presents unique challenges which hinder …

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

J Wen, Y Zhu, J Li, M Zhu, K Wu, Z Xu, N Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …

Spatialbot: Precise spatial understanding with vision language models

W Cai, I Ponomarenko, J Yuan, X Li, W Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision Language Models (VLMs) have achieved impressive performance in 2D image
understanding, however they are still struggling with spatial understanding which is the …