The landscape of publicly available vision foundation models (VFMs) such as CLIP and SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …
Contrastive pre-training of image-text foundation models such as CLIP demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream …
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings …
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting …
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two …
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions which tends to be noisy in web-crawled data. We …
X Li, Z Wang, C Xie - arXiv preprint arXiv:2306.15658, 2023 - arxiv.org
The recent work CLIPA presents an inverse scaling law for CLIP training--whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can …
While deep learning has led to huge progress in complex image classification tasks like ImageNet unexpected failure modes eg via spurious features call into question how reliably …
Z Wang, X Li, H Zhu, C Xie - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
The machine learning community has witnessed a drastic change in the training pipeline pivoted by those" foundation models" with unprecedented scales. However the field of …