Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Llm-planner: Few-shot grounded planning for embodied agents with large language models

CH Song, J Wu, C Washington… - Proceedings of the …, 2023 - openaccess.thecvf.com
This study focuses on using large language models (LLMs) as a planner for embodied
agents that can follow natural language instructions to complete complex tasks in a visually …

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

W Huang, P Abbeel, D Pathak… - … conference on machine …, 2022 - proceedings.mlr.press
Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …

Navigation with large language models: Semantic guesswork as a heuristic for planning

D Shah, MR Equi, B Osiński, F Xia… - … on Robot Learning, 2023 - proceedings.mlr.press
Navigation in unfamiliar environments presents a major challenge for robots: while mapping
and planning techniques can be used to build up a representation of the world, quickly …

Pre-trained language models for interactive decision-making

S Li, X Puig, C Paxton, Y Du, C Wang… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Language model (LM) pre-training is useful in many language processing tasks.
But can pre-trained LMs be further leveraged for more general machine learning problems …

Foundation models for decision making: Problems, methods, and opportunities

S Yang, O Nachum, Y Du, J Wei, P Abbeel… - arXiv preprint arXiv …, 2023 - arxiv.org
Foundation models pretrained on diverse data at scale have demonstrated extraordinary
capabilities in a wide range of vision and language tasks. When such models are deployed …

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc
We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

S Chen, PL Guhur, M Tapaswi… - Proceedings of the …, 2022 - openaccess.thecvf.com
Following language instructions to navigate in unseen environments is a challenging
problem for autonomous embodied agents. The agent not only needs to ground languages …

Vln bert: A recurrent vision-and-language bert for navigation

Y Hong, Q Wu, Y Qi… - Proceedings of the …, 2021 - openaccess.thecvf.com
Accuracy of many visiolinguistic tasks has benefited significantly from the application of
vision-and-language (V&L) BERT. However, its application for the task of vision-and …