查看文章

Self-supervised learning from images with a joint-embedding predictive architecture

作者

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas

发表日期

2023

研讨会论文

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

页码范围

15619-15629

简介

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

引用总数

被引用次数：148

2023202455 92

学术搜索中的文章

Self-supervised learning from images with a joint-embedding predictive architecture

M Assran, Q Duval, I Misra, P Bojanowski, P Vincent… - Proceedings of the IEEE/CVF Conference on Computer …, 2023

被引用次数：148 相关文章所有 7 个版本