Semantic image segmentation: Two decades of research

G Csurka, R Volpi, B Chidlovskii - Foundations and Trends® …, 2022 - nowpublishers.com
Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer
vision applications, providing key information for the global understanding of an image. This …

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Oneformer: One transformer to rule universal image segmentation

J Jain, J Li, MT Chiu, A Hassani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Universal Image Segmentation is not a new concept. Past attempts to unify image
segmentation include scene parsing, panoptic segmentation, and, more recently, new …

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

F Li, H Zhang, H Xu, S Liu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper we present Mask DINO, a unified object detection and segmentation
framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by …

Focal modulation networks

J Yang, C Li, X Dai, J Gao - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is
completely replaced by a focal modulation module for modeling token interactions in vision …

Matting anything

J Li, J Jain, H Shi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
In this paper we propose the Matting Anything Model (MAM) an efficient and versatile
framework for estimating the alpha matte of any instance in an image with flexible and …

Prompt-free diffusion: Taking" text" out of text-to-image diffusion models

X Xu, J Guo, Z Wang, G Huang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Text-to-image (T2I) research has grown explosively in the past year owing to the
large-scale pre-trained diffusion models and many emerging personalization and editing …

Expediting large-scale vision transformer for dense prediction without fine-tuning

W Liang, Y Yuan, H Ding, X Luo… - Advances in …, 2022 - proceedings.neurips.cc
Vision transformers have recently achieved competitive results across various vision tasks
but still suffer from heavy computation costs when processing a large number of tokens …