Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Self-supervised representation learning: Introduction, advances, and challenges

L Ericsson, H Gouk, CC Loy… - IEEE Signal Processing …, 2022 - ieeexplore.ieee.org
Self-supervised representation learning (SSRL) methods aim to provide powerful, deep
feature learning without the requirement of large annotated data sets, thus alleviating the …

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …

Minedojo: Building open-ended embodied agents with internet-scale knowledge

L Fan, G Wang, Y Jiang, A Mandlekar… - Advances in …, 2022 - proceedings.neurips.cc
Autonomous agents have made great strides in specialist domains like Atari games and Go.
However, they typically learn tabula rasa in isolated environments with limited and manually …

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

Florence: A new foundation model for computer vision

L Yuan, D Chen, YL Chen, N Codella, X Dai… - arXiv preprint arXiv …, 2021 - arxiv.org
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human …

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org
We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …

Omnivl: One foundation model for image-language and video-language tasks

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …