Object hallucination in image captioning

A Rohrbach, LA Hendricks, K Burns, T Darrell… - arXiv preprint arXiv …, 2018 - arxiv.org
Despite continuously improving performance, contemporary image captioning models are
prone to" hallucinating" objects that are not actually in a scene. One problem is that standard …

Let there be a clock on the beach: Reducing object hallucination in image captioning

AF Biten, L Gómez, D Karatzas - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Explaining an image with missing or non-existent objects is known as object bias
(hallucination) in image captioning. This behaviour is quite common in the state-of-the-art …

Describing like humans: on diversity in image captioning

Q Wang, AB Chan - … of the IEEE/CVF Conference on …, 2019 - openaccess.thecvf.com
Recently, the state-of-the-art models for image captioning have overtaken human
performance based on the most popular metrics, such as BLEU, METEOR, ROUGE and …

Compare and reweight: Distinctive image captioning using similar images sets

J Wang, W Xu, Q Wang, AB Chan - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
A wide range of image captioning models has been developed, achieving significant
improvement based on popular metrics, such as BLEU, CIDEr, and SPICE. However …

Image captioning: Transforming objects into words

S Herdade, A Kappeler, K Boakye… - Advances in neural …, 2019 - proceedings.neurips.cc
Image captioning models typically follow an encoder-decoder architecture which uses
abstract image feature vectors as input to the encoder. One of the most successful …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2025 - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Consensus graph representation learning for better grounded image captioning

W Zhang, H Shi, S Tang, J Xiao, Q Yu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
The contemporary visual captioning models frequently hallucinate objects that are not
actually in a scene, due to the visual misclassification or over-reliance on priors that …

Nocaps: Novel object captioning at scale

H Agrawal, K Desai, Y Wang, X Chen… - Proceedings of the …, 2019 - openaccess.thecvf.com
Image captioning models have achieved impressive results on datasets containing limited
visual concepts and large amounts of paired image-caption training data. However, if these …

Fine-grained image captioning with clip reward

J Cho, S Yoon, A Kale, F Dernoncourt, T Bui… - arXiv preprint arXiv …, 2022 - arxiv.org
Modern image captioning models are usually trained with text similarity objectives. However,
since reference captions in public datasets often describe the most salient common objects …