Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Foundational models defining a new era in vision: A survey and outlook

M Awais, M Naseer, S Khan, RM Anwer… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Segment everything everywhere all at once

X Zou, J Yang, H Zhang, F Li, L Li… - Advances in …, 2024 - proceedings.neurips.cc
In this work, we present SEEM, a promotable and interactive model for segmenting
everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

Generalized decoding for pixel, image, and language

X Zou, ZY Dou, J Yang, Z Gan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present X-Decoder, a generalized decoding model that can predict pixel-level
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

A simple framework for open-vocabulary segmentation and detection

H Zhang, F Li, X Zou, S Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that learns from different segmentation and detection datasets. To bridge the gap …

TEMOS: Generating Diverse Human Motions from Textual Descriptions

M Petrovich, MJ Black, G Varol - European Conference on Computer …, 2022 - Springer
We address the problem of generating diverse 3D human motions from textual descriptions.
This challenging task requires joint modeling of both modalities: understanding and …