Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Self-supervised learning from images with a joint-embedding predictive architecture

M Assran, Q Duval, I Misra… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper demonstrates an approach for learning highly semantic image representations
without relying on hand-crafted data-augmentations. We introduce the Image-based Joint …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Mage: Masked generative encoder to unify representation learning and image synthesis

T Li, H Chang, S Mishra, H Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Generative modeling and representation learning are two key tasks in computer vision.
However, these models are typically trained independently, which ignores the potential for …

Revealing the dark secrets of masked image modeling

Z Xie, Z Geng, J Hu, Z Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision
downstream tasks, but how and where MIM works remain unclear. In this paper, we compare …

Learning vision from models rivals learning vision from data

Y Tian, L Fan, K Chen, D Katabi… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce SynCLR a novel approach for learning visual representations exclusively from
synthetic images without any real data. We synthesize a large dataset of image captions …

The effectiveness of MAE pre-pretraining for billion-scale pretraining

M Singh, Q Duval, KV Alwala, H Fan… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for
visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using …

Contrastive masked autoencoders are stronger vision learners

Z Huang, X Jin, C Lu, Q Hou, MM Cheng… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Masked image modeling (MIM) has achieved promising results on various vision tasks.
However, the limited discriminability of learned representation manifests there is still plenty …

What to hide from your students: Attention-guided masked image modeling

I Kakogeorgiou, S Gidaris, B Psomas, Y Avrithis… - … on Computer Vision, 2022 - Springer
Transformers and masked language modeling are quickly being adopted and explored in
computer vision as vision transformers and masked image modeling (MIM). In this work, we …

No representation rules them all in category discovery

S Vaze, A Vedaldi, A Zisserman - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically,
given a dataset with labelled and unlabelled images, the task is to cluster all images in the …