Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …

Conditional prompt learning for vision-language models

K Zhou, J Yang, CC Loy, Z Liu - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential
to investigate ways to adapt these models to downstream datasets. A recently proposed …

Groupvit: Semantic segmentation emerges from text supervision

J Xu, S De Mello, S Liu, W Byeon… - Proceedings of the …, 2022 - openaccess.thecvf.com
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Detecting twenty-thousand classes using image-level supervision

X Zhou, R Girdhar, A Joulin, P Krähenbühl… - European Conference on …, 2022 - Springer
Current object detectors are limited in vocabulary size due to the small scale of detection
datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as …

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

Zero-shot text-guided object generation with dream fields

A Jain, B Mildenhall, JT Barron… - Proceedings of the …, 2022 - openaccess.thecvf.com
We combine neural rendering with multi-modal image and text representations to synthesize
diverse 3D objects solely from natural language descriptions. Our method, Dream Fields …