Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet
often constrained by the scarcity of high-quality image-text data. To address this bottleneck …

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

Prompt-specific poisoning attacks on text-to-image generative models

S Shan, W Ding, J Passananti, H Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Data poisoning attacks manipulate training data to introduce unexpected behaviors into
machine learning models at training time. For text-to-image generative models with massive …

CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images

A Gokaslan, AF Cooper, J Collins… - Proceedings of the …, 2024 - openaccess.thecvf.com
We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-
Commons-licensed (CC) images which yields models that are competitive with Stable …

Videocon: Robust video-language alignment via contrast captions

H Bansal, Y Bitton, I Szpektor… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du… - arXiv preprint arXiv …, 2023 - arxiv.org
Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

Towards open-ended visual recognition with large language model

Q Yu, X Shen, LC Chen - arXiv preprint arXiv:2311.08400, 2023 - arxiv.org
Localizing and recognizing objects in the open-ended physical world poses a long-standing
challenge within the domain of machine perception. Recent methods have endeavored to …