Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

Cogcom: Train large vision-language models diving into details through chain of manipulations

J Qi, M Ding, W Wang, Y Bai, Q Lv, W Hong… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to
extensive training in aligning visual instructions to answers. However, this conclusive …

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - arXiv preprint arXiv:2403.03174, 2024 - arxiv.org
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Tackling vision language tasks through learning inner monologues

D Yang, K Chen, J Rao, X Guo, Y Zhang… - Proceedings of the …, 2024 - ojs.aaai.org
Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE)
require AI models to comprehend and reason with both visual and textual content. Driven by …

Prismatic vlms: Investigating the design space of visually-conditioned language models

S Karamcheti, S Nair, A Balakrishna, P Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Visually-conditioned language models (VLMs) have seen growing adoption in applications
such as visual dialogue, scene understanding, and robotic task planning; adoption that has …

Replan: Robotic replanning with perception and language models

M Skreta, Z Zhou, JL Yuan, K Darvish… - arXiv preprint arXiv …, 2024 - arxiv.org
Advancements in large language models (LLMs) have demonstrated their potential in
facilitating high-level reasoning, logical reasoning and robotics planning. Recently, LLMs …

How Far Are We from Intelligent Visual Deductive Reasoning?

Y Zhang, H Bai, R Zhang, J Gu, S Zhai… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible
strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a …

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

P Anderson, Q Wu, D Teney, J Bruce… - Proceedings of the …, 2018 - openaccess.thecvf.com
A robot that can carry out a natural-language instruction has been a dream since before the
Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot …

Vision-language foundation models as effective robot imitators

X Li, M Liu, H Zhang, C Yu, J Xu, H Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent progress in vision language foundation models has shown their ability to understand
multimodal data and resolve complicated vision language tasks, including robotics …

[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control

B Zitkovich, T Yu, S Xu, P Xu, T Xiao… - … on Robot Learning, 2023 - proceedings.mlr.press
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …