Noise-aware learning from web-crawled image-text data for image captioning

HAAK Hammoud, H Itani, F Pizzati, P Torr… - arXiv preprint arXiv …, 2024 - arxiv.org

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …

被引用次数：35 相关文章所有 4 个版本

[PDF] aaai.org

Noise-aware image captioning with progressively exploring mismatched words

Z Fu, K Song, L Zhou, Y Yang - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org

Image captioning aims to automatically generate captions for images by learning a cross-
modal generator from vision to language. The large amount of image-text pairs required for …

被引用次数：20 相关文章所有 2 个版本

[PDF] aaai.org

Image captioning with multi-context synthetic data

F Ma, Y Zhou, F Rao, Y Zhang, X Sun - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Image captioning requires numerous annotated image-text pairs, resulting in substantial
annotation costs. Recently, large models (eg diffusion models and large language models) …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

P Allgeuer, K Ahrens, S Wermter - arXiv preprint arXiv:2407.11211, 2024 - arxiv.org

We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that
uses an autoregressive transformer to generatively output classification labels as language …

被引用次数：1 相关文章

[PDF] pengxi.me

Cross-modal Retrieval with Noisy Correspondence via Consistency Refining and Mining

X Ma, M Yang, Y Li, P Hu, J Lv… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption
that the annotated cross-modal correspondence is faultless. In practice, however, the …

被引用次数：10 相关文章所有 4 个版本

[PDF] aclanthology.org

Direct Metric Optimization for Image Captioning through Reward-Weighted Augmented Data Utilization

T Takada, Y Suzuki, H Takushima… - Proceedings of the …, 2024 - aclanthology.org

While image captioning is an essential field of vision language models (VLM), a lack of
continuity between the learning objective and final performance metrics of VLMs …

CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Q Cao, M Najibi, S Mehta - arXiv preprint arXiv:2410.11963, 2024 - arxiv.org

Pretraining robust vision or multimodal foundation models (eg, CLIP) relies on large-scale
datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …

[PDF] arxiv.org

被引用次数：3 相关文章所有 4 个版本

高级搜索

QQ 群