Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The limited availability of annotated data often hinders real-world applications of machine
learning. To efficiently learn from small quantities of multimodal data, we leverage the …

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

Improving multimodal datasets with image captioning

T Nguyen, SY Gadre, G Ilharco… - Advances in Neural …, 2024 - proceedings.neurips.cc
Massive web datasets play a key role in the success of large vision-language models like
CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

J Urbanek, F Bordes, P Astolfi… - Proceedings of the …, 2024 - openaccess.thecvf.com
Curation methods for massive vision-language datasets trade off between dataset size and
quality. However even the highest quality of available curated captions are far too short to …

CLAIR: Evaluating image captions with large language models

D Chan, S Petryk, JE Gonzalez, T Darrell… - arXiv preprint arXiv …, 2023 - arxiv.org
The evaluation of machine-generated image captions poses an interesting yet persistent
challenge. Effective evaluation measures must consider numerous dimensions of similarity …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Robust cross-modal representation learning with progressive self-distillation

A Andonian, S Chen, R Hamid - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The learning objective of vision-language approach of CLIP does not effectively account for
the noisy many-to-many correspondences found in web-harvested image captioning …

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du… - arXiv preprint arXiv …, 2023 - arxiv.org
Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

Smallcap: lightweight image captioning prompted with retrieval augmentation

R Ramos, B Martins, D Elliott… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in image captioning have focused on scaling the data and model size,
substantially increasing the cost of pre-training and finetuning. As an alternative to large …