Z Fu, K Song, L Zhou, Y Yang - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Image captioning aims to automatically generate captions for images by learning a cross- modal generator from vision to language. The large amount of image-text pairs required for …
Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (eg diffusion models and large language models) …
We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language …
The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption that the annotated cross-modal correspondence is faultless. In practice, however, the …
T Takada, Y Suzuki, H Takushima… - Proceedings of the …, 2024 - aclanthology.org
While image captioning is an essential field of vision language models (VLM), a lack of continuity between the learning objective and final performance metrics of VLMs …
Pretraining robust vision or multimodal foundation models (eg, CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …
In this report, we introduce NICE project\footnote {\url {https://nice. lgresearch. ai/}} and share the results and outcomes of NICE challenge 2023. This project is designed to challenge the …
Large repositories of image-caption pairs are essential for the development of vision- language models. However, these datasets are often extracted from noisy data scraped from …
T Kim, P Ahn, S Kim, S Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this report we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to …