Cross-modal text and visual generation: A systematic review. Part 1: Image to text

M Żelaszczyk, J Mańdziuk - Information Fusion, 2023 - Elsevier
We review the existing literature on generating text from visual data under the cross-modal
generation umbrella, which affords us to compare and contrast various approaches taking …

UMIC: An unreferenced metric for image captioning via contrastive learning

H Lee, S Yoon, F Dernoncourt, T Bui, K Jung - arXiv preprint arXiv …, 2021 - arxiv.org
Despite the success of various text generation metrics such as BERTScore, it is still difficult
to evaluate the image captions without enough reference captions due to the diversity of the …

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Y Wada, K Kaneda, D Saito… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Establishing an automatic evaluation metric that closely aligns with human judgments is
essential for effectively developing image captioning models. Recent data-driven metrics …

Smurf: Semantic and linguistic understanding fusion for caption evaluation via typicality analysis

J Feinglass, Y Yang - arXiv preprint arXiv:2106.01444, 2021 - arxiv.org
The open-ended nature of visual captioning makes it a challenging area for evaluation. The
majority of proposed models rely on specialized training to improve human-correlation …

Ic3: Image captioning by committee consensus

DM Chan, A Myers, S Vijayanarasimhan… - arXiv preprint arXiv …, 2023 - arxiv.org
If you ask a human to describe an image, they might do so in a thousand different ways.
Traditionally, image captioning models are trained to generate a single" best"(most like a …

Towards an Exhaustive Evaluation of Vision-Language Foundation Models

E Salin, S Ayache, B Favre - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Vision-language foundation models have had considerable increase in performances in the
last few years. However, there is still a lack of comprehensive evaluation methods able to …

[PDF][PDF] 基于深度学习的图像描述综述

石义乐, 杨文忠, 杜慧祥, 王丽花, 王婷, 理珊珊 - 电子学报, 2021 - ejournal.org.cn
图像描述旨在通过提取图像的特征输入到语言生成模型中最后输出图像对应的描述,
来解决人工智能中自然语言处理与计算机视觉的交叉领域问题——智能图像理解. 现对2015 …

Validated image caption rating dataset

LD Narins, A Scott, A Gautam… - Advances in …, 2024 - proceedings.neurips.cc
We present a new high-quality validated image caption rating (VICR) dataset. How well a
caption fits an image can be difficult to assess due to the subjective nature of caption quality …

# PraCegoVer: A Large Dataset for Image Captioning in Portuguese

GO dos Santos, EL Colombini, S Avila - Data, 2022 - mdpi.com
Automatically describing images using natural sentences is essential to visually impaired
people's inclusion on the Internet. This problem is known as Image Captioning. There are …

Cross-modal language generation using pivot stabilization for web-scale language coverage

AV Thapliyal, R Soricut - arXiv preprint arXiv:2005.00246, 2020 - arxiv.org
Cross-modal language generation tasks such as image captioning are directly hurt in their
ability to support non-English languages by the trend of data-hungry models combined with …