From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

All you may need for vqa are image captions

S Changpinyo, D Kukliansky, I Szpektor… - arXiv preprint arXiv …, 2022 - arxiv.org
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …

From images to textual prompts: Zero-shot vqa with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li, D Tao… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

A thorough review of models, evaluation metrics, and datasets on image captioning

G Luo, L Cheng, C Jing, C Zhao… - IET Image Processing, 2022 - Wiley Online Library
Image captioning means generate descriptive sentences from a query image automatically.
It has recently received widespread attention from the computer vision and natural language …

Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics

E Kreiss, C Bennett, S Hooshmand, E Zelikman… - arXiv preprint arXiv …, 2022 - arxiv.org
Few images on the Web receive alt-text descriptions that would make them accessible to
blind and low vision (BLV) users. Image-based NLG systems have progressed to the point …

Maxm: Towards multilingual visual question answering

S Changpinyo, L Xue, M Yarom, AV Thapliyal… - arXiv preprint arXiv …, 2022 - arxiv.org
Visual Question Answering (VQA) has been primarily studied through the lens of the English
language. Yet, tackling VQA in other languages in the same manner would require a …

Pre-training multi-modal dense retrievers for outside-knowledge visual question answering

A Salemi, M Rafiee, H Zamani - Proceedings of the 2023 ACM SIGIR …, 2023 - dl.acm.org
This paper studies a category of visual question answering tasks, in which accessing
external knowledge is necessary for answering the questions. This category is called …

ZVQAF: Zero-shot visual question answering with feedback from large language models

C Liu, C Wang, Y Peng, Z Li - Neurocomputing, 2024 - Elsevier
Due to the prominent zero-shot generalization in new language tasks shown by large
language models (LLMs), applying LLMs for zero-shot visual question answering (VQA) has …

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

E Kreiss, E Zelikman, C Potts, N Haber - arXiv preprint arXiv:2309.11710, 2023 - arxiv.org
Referenceless metrics (eg, CLIPScore) use pretrained vision--language models to assess
image descriptions directly without costly ground-truth reference texts. Such methods can …

Maskeval: Weighted mlm-based evaluation for text summarization and simplification

YL Liu, R Bawden, T Scialom, B Sagot… - arXiv preprint arXiv …, 2022 - arxiv.org
In text summarization and simplification, system outputs must be evaluated along multiple
dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide …