On the prospects of incorporating large language models (llms) in automated planning and scheduling (aps)

V Pallagani, BC Muppasani, K Roy, F Fabiano… - Proceedings of the …, 2024 - ojs.aaai.org
Abstract Automated Planning and Scheduling is among the growing areas in Artificial
Intelligence (AI) where mention of LLMs has gained popularity. Based on a comprehensive …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

B Chen, Z Xu, S Kirmani, B Ichter… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding and reasoning about spatial relationships is crucial for Visual Question
Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive …

Openeqa: Embodied question answering in the era of foundation models

A Majumdar, A Ajay, X Zhang, P Putta… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present a modern formulation of Embodied Question Answering (EQA) as the task of
understanding an environment well enough to answer questions about it in natural …

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Rt-h: Action hierarchies using language

S Belkhale, T Ding, T Xiao, P Sermanet… - arXiv preprint arXiv …, 2024 - arxiv.org
Language provides a way to break down complex concepts into digestible pieces. Recent
works in robot imitation learning use language-conditioned policies that predict actions …

3d-vla: A 3d vision-language-action generative world model

H Zhen, X Qiu, P Chen, J Yang, X Yan, Y Du… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the
broader realm of the 3D physical world. Furthermore, they perform action prediction by …

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

S Singh, G Pavlakos, D Stamoulis - arXiv preprint arXiv:2405.18831, 2024 - arxiv.org
As interest in" reformulating" the 3D Visual Question Answering (VQA) problem in the
context of foundation models grows, it is imperative to assess how these new paradigms …

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

H Zhang, Y Wang, Y Tang, Y Liu, J Feng, J Dai… - arXiv preprint arXiv …, 2024 - arxiv.org
Benefiting from the advancements in large language models and cross-modal alignment,
existing multi-modal video understanding methods have achieved prominent performance in …

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …