Vision-language models (VLMs) are trained for thousands of GPU hours on carefully selected subsets of massive web scrapes. For instance the LAION public dataset retained …
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web- crawled datasets. This underscores the critical need for dataset pruning as the quality of …
Web-crawled datasets are pivotal to the success of pre-training vision-language models, exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …
Despite CLIP [29] being the foundation model in numerous vision-language applications, CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to 'Parrot'the …
In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this …
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could …
The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it …