Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …
Contrastive pre-training of image-text foundation models such as CLIP demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream …
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models but there …
Data poisoning attacks manipulate training data to introduce unexpected behaviors into machine learning models at training time. For text-to-image generative models with massive …
We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative- Commons-licensed (CC) images which yields models that are competitive with Stable …
Despite being (pre) trained on a massive amount of data state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video …
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web- crawled datasets. This underscores the critical need for dataset pruning as the quality of …
Web-crawled datasets are pivotal to the success of pre-training vision-language models, exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …
Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to …