相关文章- 学术资源搜索

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

Cogcom: Train large vision-language models diving into details through chain of manipulations

J Qi, M Ding, W Wang, Y Bai, Q Lv, W Hong… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to
extensive training in aligning visual instructions to answers. However, this conclusive …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - arXiv preprint arXiv:2403.03174, 2024 - arxiv.org

Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

被引用次数：14 相关文章所有 3 个版本

[PDF] aaai.org

Tackling vision language tasks through learning inner monologues

D Yang, K Chen, J Rao, X Guo, Y Zhang… - Proceedings of the …, 2024 - ojs.aaai.org

Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE)
require AI models to comprehend and reason with both visual and textual content. Driven by …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Prismatic vlms: Investigating the design space of visually-conditioned language models

S Karamcheti, S Nair, A Balakrishna, P Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

Visually-conditioned language models (VLMs) have seen growing adoption in applications
such as visual dialogue, scene understanding, and robotic task planning; adoption that has …

被引用次数：22 相关文章所有 3 个版本

[PDF] arxiv.org

Replan: Robotic replanning with perception and language models

M Skreta, Z Zhou, JL Yuan, K Darvish… - arXiv preprint arXiv …, 2024 - arxiv.org

Advancements in large language models (LLMs) have demonstrated their potential in
facilitating high-level reasoning, logical reasoning and robotics planning. Recently, LLMs …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

How Far Are We from Intelligent Visual Deductive Reasoning?

Y Zhang, H Bai, R Zhang, J Gu, S Zhai… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible
strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

P Anderson, Q Wu, D Teney, J Bruce… - Proceedings of the …, 2018 - openaccess.thecvf.com

A robot that can carry out a natural-language instruction has been a dream since before the
Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot …

被引用次数：1324 相关文章所有 16 个版本

[PDF] arxiv.org

Vision-language foundation models as effective robot imitators

X Li, M Liu, H Zhang, C Yu, J Xu, H Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent progress in vision language foundation models has shown their ability to understand
multimodal data and resolve complicated vision language tasks, including robotics …

被引用次数：36 相关文章所有 3 个版本

[HTML] mlr.press

[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control

B Zitkovich, T Yu, S Xu, P Xu, T Xiao… - … on Robot Learning, 2023 - proceedings.mlr.press

We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …

被引用次数：85 相关文章所有 2 个版本

高级搜索

QQ 群

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

Cogcom: Train large vision-language models diving into details through chain of manipulations

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

Tackling vision language tasks through learning inner monologues

Prismatic vlms: Investigating the design space of visually-conditioned language models

Replan: Robotic replanning with perception and language models

How Far Are We from Intelligent Visual Deductive Reasoning?

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Vision-language foundation models as effective robot imitators

[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control

相关搜索

引用