Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

P Li, T Liu, Y Li, M Han, H Geng, S Wang, Y Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Autonomous robotic systems capable of learning novel manipulation tasks are poised to
transform industries from manufacturing to service automation. However, modern methods …

Real-World Robot Applications of Foundation Models: A Review

K Kawaharazuka, T Matsushima… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

D Guo, Y Xiang, S Zhao, X Zhu, M Tomizuka… - arXiv preprint arXiv …, 2024 - arxiv.org
Robotic grasping is a fundamental aspect of robot functionality, defining how robots interact
with objects. Despite substantial progress, its generalizability to counter-intuitive or long …

What Foundation Models can Bring for Robot Learning in Manipulation: A Survey

D Li, Y Jin, H Yu, J Shi, X Hao, P Hao, H Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
The realization of universal robots is an ultimate goal of researchers. However, a key hurdle
in achieving this goal lies in the robots' ability to manipulate objects in their unstructured …

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

TW Ke, N Gkanatsios, K Fragkiadaki - arXiv preprint arXiv:2402.10885, 2024 - arxiv.org
We marry diffusion policies and 3D scene representations for robot manipulation. Diffusion
policies learn the action distribution conditioned on the robot and environment state using …

Understanding Long Videos in One Multimodal Language Model Pass

K Ranasinghe, X Li, K Kahatapitiya… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs), known to contain a strong awareness of world knowledge,
have allowed recent approaches to achieve excellent performance on Long-Video …

" Task Success" is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

L Guan, Y Zhou, D Liu, Y Zha, HB Amor… - arXiv preprint arXiv …, 2024 - arxiv.org
Large-scale generative models are shown to be useful for sampling meaningful candidate
solutions, yet they often overlook task constraints and user preferences. Their full power is …

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

J Yuan, S Sun, D Omeiza, B Zhao, P Newman… - arXiv preprint arXiv …, 2024 - arxiv.org
Robots powered by'blackbox'models need to provide human-understandable explanations
which we can trust. Hence, explainability plays a critical role in trustworthy autonomous …

General Flow as Foundation Affordance for Scalable Robot Learning

C Yuan, C Wen, T Zhang, Y Gao - arXiv preprint arXiv:2401.11439, 2024 - arxiv.org
We address the challenge of acquiring real-world manipulation skills with a scalable
framework. Inspired by the success of large-scale auto-regressive prediction in Large …

A Survey of Robotic Language Grounding: Tradeoffs Between Symbols and Embeddings

V Cohen, JX Liu, R Mooney, S Tellex… - arXiv preprint arXiv …, 2024 - arxiv.org
With large language models, robots can understand language more flexibly and more
capable than ever before. This survey reviews recent literature and situates it into a …