Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2024 - proceedings.neurips.cc
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action

D Shah, B Osiński, S Levine - Conference on robot …, 2023 - proceedings.mlr.press
Goal-conditioned policies for robotic navigation can be trained on large, unannotated
datasets, providing for good generalization to real-world settings. However, particularly in …

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models

Z Lin, S Yu, Z Kuang, D Pathak… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

K Saito, K Sohn, X Zhang, CL Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …

Towards language models that can see: Computer vision through the lens of natural language

W Berrios, G Mittal, T Thrush, D Kiela… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose LENS, a modular approach for tackling computer vision problems by leveraging
the power of large language models (LLMs). Our system uses a language model to reason …

Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models

H Ha, S Song - arXiv preprint arXiv:2207.11514, 2022 - arxiv.org
We study open-world 3D scene understanding, a family of tasks that require agents to
reason about their 3D environment with an open-set vocabulary and out-of-domain visual …

Weakly supervised 3d open-vocabulary segmentation

K Liu, F Zhan, J Zhang, M Xu, Y Yu… - Advances in …, 2023 - proceedings.neurips.cc
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception
and thus a crucial objective in computer vision research. However, this task is heavily …