Veclip: Improving clip training via visual-enriched captions

Z Lai, H Zhang, B Zhang, W Wu, H Bai… - … on Computer Vision, 2025 - Springer
Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …

Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic

S Goyal, P Maini, ZC Lipton… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully
selected subsets of massive web scrapes. For instance the LAION public dataset retained …

No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - The Thirty-eighth …, 2024 - openreview.net
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du, Z Gan… - 2023 - openreview.net
Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

Rephrasing the web: A recipe for compute and data-efficient language modeling

P Maini, S Seto, H Bai, D Grangier, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models are trained on massive scrapes of the web, which are often
unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …

Parrot captions teach clip to spot text

Y Lin, C He, AJ Wang, B Wang, W Li… - European Conference on …, 2025 - Springer
Despite CLIP [29] being the foundation model in numerous vision-language applications,
CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to 'Parrot'the …

Hype: Hyperbolic entailment filtering for underspecified images and texts

W Kim, S Chun, T Kim, D Han, S Yun - European Conference on Computer …, 2025 - Springer
In an era where the volume of data drives the effectiveness of self-supervised learning, the
specificity and clarity of data semantics play a crucial role in model training. Addressing this …

An introduction to vision-language modeling

F Bordes, RY Pang, A Ajay, AC Li, A Bardes… - arXiv preprint arXiv …, 2024 - arxiv.org
Following the recent popularity of Large Language Models (LLMs), several attempts have
been made to extend them to the visual domain. From having a visual assistant that could …

The devil is in the details: A deep dive into the rabbit hole of data filtering

H Yu, Y Tian, S Kumar, L Yang, H Wang - arXiv preprint arXiv:2309.15954, 2023 - arxiv.org
The quality of pre-training data plays a critical role in the performance of foundation models.
Popular foundation models often design their own recipe for data filtering, which makes it …