A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
Abstract Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation include scene parsing, panoptic segmentation, and, more recently, new …
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their …
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by …
J Yang, C Li, X Dai, J Gao - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision …
J Li, J Jain, H Shi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
In this paper we propose the Matting Anything Model (MAM) an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and …
Abstract Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing …
Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens …