See Say and Segment: Teaching LMMs to Overcome False Premises

TH Wu, G Biamby, D Chan, L Dunlap… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-
vocabulary language grounding and segmentation but can suffer under false premises …

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2025 - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

A novel evaluation framework for image2text generation

JH Huang, H Zhu, Y Shen, S Rudinac… - arXiv preprint arXiv …, 2024 - arxiv.org
Evaluating the quality of automatically generated image descriptions is challenging,
requiring metrics that capture various aspects such as grammaticality, coverage …

AutoAD III: The Prequel-Back to the Pixels

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generating Audio Description (AD) for movies is a challenging task that requires
fine-grained visual understanding and an awareness of the characters and their names …

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

N Moratelli, D Caffagni, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
The conventional training approach for image captioning involves pre-training a network
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …

MICap: A Unified Model for Identity-aware Movie Descriptions

H Raajesh, NR Desanur, Z Khan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Characters are an important aspect of any storyline and identifying and including them in
descriptions is necessary for story understanding. While previous work has largely ignored …

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

TH Wu, JE Gonzalez, T Darrell, DM Chan - arXiv preprint arXiv …, 2024 - arxiv.org
The Automated Audio Captioning (AAC) task asks models to generate natural language
descriptions of an audio input. Evaluating these machine-generated audio captions is a …

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Y Gao, L Fischer, A Lintner, S Ebling - arXiv preprint arXiv:2410.08860, 2024 - arxiv.org
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

S Sarto, N Moratelli, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant advancements in caption generation, existing evaluation metrics often fail
to capture the full quality or fine-grained details of captions. This is mainly due to their …