Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data

N Popp, JH Metzen, M Hein - arXiv preprint arXiv:2404.16637, 2024 - arxiv.org
Multi-modal foundation models such as CLIP have showcased impressive zero-shot
capabilities. However, their applicability in resource-constrained environments is limited due …

Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores

K Jeong, W Lee, W Nam, M Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from
team DSBA LAB which is a new framework used to evaluate and rank captions for a given …

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

S Geng, CY Hsieh, V Ramanujan, M Wallingford… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative text-to-image models enable us to synthesize unlimited amounts of images in a
controllable manner, spurring many recent efforts to train vision models with synthetic data …

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

Y Oh, P Ahn, J Kim, G Song, S Lee, IS Kweon… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot
recognition abilities yet face challenges in visio-linguistic compositionality, particularly in …

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

M Fang, S Ji, J Zuo, H Huang, Y Xia, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …

Tuning-free Universally-Supervised Semantic Segmentation

X Yang, X Gong - arXiv preprint arXiv:2405.14294, 2024 - arxiv.org
This work presents a tuning-free semantic segmentation framework based on classifying
SAM masks by CLIP, which is universally applicable to various types of supervision. Initially …

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

PKA Vasu, H Pouransari, F Faghri, O Tuzel - arXiv preprint arXiv …, 2024 - arxiv.org
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But
recent studies have shown that learnt representations in CLIP are not well suited for dense …

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

YG Hsieh, CY Hsieh, SY Yeh, L Béthune… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans describe complex scenes with compositionality, using simple text descriptions
enriched with links and relationships. While vision-language research has aimed to develop …

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

T Zhu, MC Jung, J Clark - arXiv preprint arXiv:2404.08535, 2024 - arxiv.org
Contrastive learning has gained widespread adoption for retrieval tasks due to its minimal
requirement for manual annotations. However, popular contrastive frameworks typically …

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

R Vemulapalli, H Pouransari, F Faghri, S Mehta… - Forty-first International … - openreview.net
Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive
performance on various downstream tasks, especially with limited labeled target data …