From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Positive-augmented contrastive learning for image and video captioning evaluation

S Sarto, M Barraco, M Cornia… - Proceedings of the …, 2023 - openaccess.thecvf.com
The CLIP model has been recently proven to be very effective for a variety of cross-modal
tasks, including the evaluation of captions generated from vision-and-language …

Fine-grained image captioning with clip reward

J Cho, S Yoon, A Kale, F Dernoncourt, T Bui… - arXiv preprint arXiv …, 2022 - arxiv.org
Modern image captioning models are usually trained with text similarity objectives. However,
since reference captions in public datasets often describe the most salient common objects …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2024 - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

Test-time distribution normalization for contrastively learned visual-language models

Y Zhou, J Ren, F Li, R Zabih… - Advances in Neural …, 2024 - proceedings.neurips.cc
Advances in the field of visual-language contrastive learning have made it possible for many
downstream applications to be carried out efficiently and accurately by simply taking the dot …

UMIC: An unreferenced metric for image captioning via contrastive learning

H Lee, S Yoon, F Dernoncourt, T Bui, K Jung - arXiv preprint arXiv …, 2021 - arxiv.org
Despite the success of various text generation metrics such as BERTScore, it is still difficult
to evaluate the image captions without enough reference captions due to the diversity of the …

EMScore: Evaluating video captioning via coarse-grained and fine-grained embedding matching

Y Shi, X Yang, H Xu, C Yuan, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Current metrics for video captioning are mostly based on the text-level comparison between
reference and candidate captions. However, they have some insuperable drawbacks, eg …

Mutual information divergence: A unified metric for multimodal generative models

JH Kim, Y Kim, J Lee, KM Yoo… - Advances in Neural …, 2022 - proceedings.neurips.cc
Text-to-image generation and image captioning are recently emerged as a new
experimental paradigm to assess machine intelligence. They predict continuous quantity …

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Y Wada, K Kaneda, D Saito… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Establishing an automatic evaluation metric that closely aligns with human judgments is
essential for effectively developing image captioning models. Recent data-driven metrics …