X Li, M Liu, H Zhang, C Yu, J Xu, H Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics …
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, eg" can you get me the pink stuffed whale?" to their …
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer …
F Liu, F Yan, L Zheng, C Feng, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and …
Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models …
The integration of Multimodal Large Language Models (MLLMs) with robotic systems has significantly enhanced the ability of robots to interpret and act upon natural language …
Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object …
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V (ision), by integrating observations of human actions to facilitate robotic manipulation. This …
We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos …