相关文章- 学术资源搜索

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - arXiv preprint arXiv:2403.03174, 2024 - arxiv.org

Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Vision-language foundation models as effective robot imitators

X Li, M Liu, H Zhang, C Yu, J Xu, H Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent progress in vision language foundation models has shown their ability to understand
multimodal data and resolve complicated vision language tasks, including robotics …

被引用次数：27 相关文章所有 3 个版本

[PDF] arxiv.org

Open-world object manipulation using pre-trained vision-language models

A Stone, T Xiao, Y Lu, K Gopalakrishnan… - arXiv preprint arXiv …, 2023 - arxiv.org

For robots to follow instructions from people, they must be able to connect the rich semantic
information in human vocabulary, eg" can you get me the pink stuffed whale?" to their …

被引用次数：80 相关文章所有 4 个版本

[PDF] arxiv.org

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

F Liu, F Yan, L Zheng, C Feng, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel
paradigm, aiming to enhance the model's ability to generalize to new objects and …

被引用次数：1 相关文章所有 2 个版本

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu, B Ichter… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

被引用次数：39 相关文章所有 2 个版本

[PDF] arxiv.org

Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models

S Huang, I Ponomarenko, Z Jiang, X Li, X Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

The integration of Multimodal Large Language Models (MLLMs) with robotic systems has
significantly enhanced the ability of robots to interpret and act upon natural language …

被引用次数：2 相关文章所有 3 个版本

[PDF] neurips.cc

Vlmbench: A compositional benchmark for vision-and-language manipulation

K Zheng, X Chen, OC Jenkins… - Advances in Neural …, 2022 - proceedings.neurips.cc

Benefiting from language flexibility and compositionality, humans naturally intend to use
language to command an embodied agent for complex tasks such as navigation and object …

被引用次数：39 相关文章所有 6 个版本

[PDF] arxiv.org

Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration

N Wake, A Kanehira, K Sasabuchi, J Takamatsu… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V
(ision), by integrating observations of human actions to facilitate robotic manipulation. This …

被引用次数：23 相关文章所有 2 个版本

Towards generalizable zero-shot manipulation via translating human interaction plans

H Bharadhwaj, A Gupta, V Kumar, S Tulsiani - arXiv preprint arXiv …, 2023 - arxiv.org

We pursue the goal of developing robots that can interact zero-shot with generic unseen
objects via a diverse repertoire of manipulation skills and show how passive human videos …

被引用次数：12 相关文章所有 2 个版本

高级搜索

QQ 群

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

Vision-language foundation models as effective robot imitators

Open-world object manipulation using pre-trained vision-language models

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Physically grounded vision-language models for robotic manipulation

Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models

Vlmbench: A compositional benchmark for vision-and-language manipulation

Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration

Towards generalizable zero-shot manipulation via translating human interaction plans

相关搜索

引用