SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

HAAK Hammoud, H Itani, F Pizzati, P Torr… - arXiv preprint arXiv …, 2024 - arxiv.org
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …

Noise-aware image captioning with progressively exploring mismatched words

Z Fu, K Song, L Zhou, Y Yang - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Image captioning aims to automatically generate captions for images by learning a cross-
modal generator from vision to language. The large amount of image-text pairs required for …

Image captioning with multi-context synthetic data

F Ma, Y Zhou, F Rao, Y Zhang, X Sun - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Image captioning requires numerous annotated image-text pairs, resulting in substantial
annotation costs. Recently, large models (eg diffusion models and large language models) …

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

P Allgeuer, K Ahrens, S Wermter - arXiv preprint arXiv:2407.11211, 2024 - arxiv.org
We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that
uses an autoregressive transformer to generatively output classification labels as language …

Cross-modal Retrieval with Noisy Correspondence via Consistency Refining and Mining

X Ma, M Yang, Y Li, P Hu, J Lv… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption
that the annotated cross-modal correspondence is faultless. In practice, however, the …

Direct Metric Optimization for Image Captioning through Reward-Weighted Augmented Data Utilization

T Takada, Y Suzuki, H Takushima… - Proceedings of the …, 2024 - aclanthology.org
While image captioning is an essential field of vision language models (VLM), a lack of
continuity between the learning objective and final performance metrics of VLMs …

CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Q Cao, M Najibi, S Mehta - arXiv preprint arXiv:2410.11963, 2024 - arxiv.org
Pretraining robust vision or multimodal foundation models (eg, CLIP) relies on large-scale
datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …

NICE 2023 Zero-shot Image Captioning Challenge

T Kim, P Ahn, S Kim, S Lee, M Marsden, A Sala… - arXiv preprint arXiv …, 2023 - arxiv.org
In this report, we introduce NICE project\footnote {\url {https://nice. lgresearch. ai/}} and share
the results and outcomes of NICE challenge 2023. This project is designed to challenge the …

LEMoN: Label Error Detection using Multimodal Neighbors

H Zhang, A Balagopalan, N Oufattole, H Jeong… - arXiv preprint arXiv …, 2024 - arxiv.org
Large repositories of image-caption pairs are essential for the development of vision-
language models. However, these datasets are often extracted from noisy data scraped from …

NICE: CVPR 2023 challenge on zero-shot image captioning

T Kim, P Ahn, S Kim, S Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this report we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation)
project and share the results and outcomes of 2023 challenge. This project is designed to …