Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we present an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip

Q Yu, J He, X Deng, X Shen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …

Gligen: Open-set grounded text-to-image generation

Y Li, H Liu, Q Wu, F Mu, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we …

Simple open-vocabulary object detection

M Minderer, A Gritsenko, A Stone, M Neumann… - … on Computer Vision, 2022 - Springer
Combining simple architectures with large-scale pre-training has led to massive
improvements in image classification. For object detection, pre-training and scaling …

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

A simple framework for open-vocabulary segmentation and detection

H Zhang, F Li, X Zou, S Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that learns from different segmentation and detection datasets. To bridge the gap …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Detecting twenty-thousand classes using image-level supervision

X Zhou, R Girdhar, A Joulin, P Krähenbühl… - European Conference on …, 2022 - Springer
Current object detectors are limited in vocabulary size due to the small scale of detection
datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as …