Improving multimodal datasets with image captioning

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet
often constrained by the scarcity of high-quality image-text data. To address this bottleneck …

被引用次数：144 相关文章所有 3 个版本

[PDF] thecvf.com

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

被引用次数：15 相关文章所有 3 个版本

[PDF] thecvf.com

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com

Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

被引用次数：9 相关文章所有 2 个版本

[PDF] thecvf.com

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Prompt-specific poisoning attacks on text-to-image generative models

S Shan, W Ding, J Passananti, H Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org

Data poisoning attacks manipulate training data to introduce unexpected behaviors into
machine learning models at training time. For text-to-image generative models with massive …

被引用次数：28 相关文章所有 2 个版本

[PDF] thecvf.com

CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images

A Gokaslan, AF Cooper, J Collins… - Proceedings of the …, 2024 - openaccess.thecvf.com

We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-
Commons-licensed (CC) images which yields models that are competitive with Stable …

被引用次数：6 相关文章所有 4 个版本

[PDF] thecvf.com

Videocon: Robust video-language alignment via contrast captions

H Bansal, Y Bitton, I Szpektor… - Proceedings of the …, 2024 - openaccess.thecvf.com

Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du… - arXiv preprint arXiv …, 2023 - arxiv.org

Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Towards open-ended visual recognition with large language model

Q Yu, X Shen, LC Chen - arXiv preprint arXiv:2311.08400, 2023 - arxiv.org

Localizing and recognizing objects in the open-ended physical world poses a long-standing
challenge within the domain of machine perception. Recent methods have endeavored to …

被引用次数：4 相关文章所有 2 个版本

高级搜索

QQ 群