Robovqa: Multimodal long-horizon reasoning for robotics

On the prospects of incorporating large language models (llms) in automated planning and scheduling (aps)

V Pallagani, BC Muppasani, K Roy, F Fabiano… - Proceedings of the …, 2024 - ojs.aaai.org

Abstract Automated Planning and Scheduling is among the growing areas in Artificial
Intelligence (AI) where mention of LLMs has gained popularity. Based on a comprehensive …

被引用次数：18 相关文章所有 5 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：45 相关文章所有 3 个版本

[PDF] thecvf.com

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

B Chen, Z Xu, S Kirmani, B Ichter… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding and reasoning about spatial relationships is crucial for Visual Question
Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive …

被引用次数：41 相关文章所有 5 个版本

[PDF] thecvf.com

Openeqa: Embodied question answering in the era of foundation models

A Majumdar, A Ajay, X Zhang, P Putta… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present a modern formulation of Embodied Question Answering (EQA) as the task of
understanding an environment well enough to answer questions about it in natural …

被引用次数：23 相关文章所有 2 个版本

[PDF] arxiv.org

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Rt-h: Action hierarchies using language

S Belkhale, T Ding, T Xiao, P Sermanet… - arXiv preprint arXiv …, 2024 - arxiv.org

Language provides a way to break down complex concepts into digestible pieces. Recent
works in robot imitation learning use language-conditioned policies that predict actions …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

3d-vla: A 3d vision-language-action generative world model

H Zhen, X Qiu, P Chen, J Yang, X Yan, Y Du… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the
broader realm of the 3D physical world. Furthermore, they perform action prediction by …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

S Singh, G Pavlakos, D Stamoulis - arXiv preprint arXiv:2405.18831, 2024 - arxiv.org

As interest in" reformulating" the 3D Visual Question Answering (VQA) problem in the
context of foundation models grows, it is imperative to assess how these new paradigms …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

H Zhang, Y Wang, Y Tang, Y Liu, J Feng, J Dai… - arXiv preprint arXiv …, 2024 - arxiv.org

Benefiting from the advancements in large language models and cross-modal alignment,
existing multi-modal video understanding methods have achieved prominent performance in …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群