Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

被引用次数：179 相关文章所有 6 个版本

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

被引用次数：489 相关文章所有 6 个版本

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：75 相关文章所有 5 个版本

[PDF] arxiv.org

All you may need for vqa are image captions

S Changpinyo, D Kukliansky, I Szpektor… - arXiv preprint arXiv …, 2022 - arxiv.org

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …

被引用次数：59 相关文章所有 7 个版本

[PDF] thecvf.com

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Z Li, X Wang, E Stengel-Eskin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) models often perform poorly on out-of-distribution
data and struggle on domain generalization. Due to the multi-modal nature of this task …

被引用次数：28 相关文章所有 8 个版本

[PDF] arxiv.org

From images to textual prompts: Zero-shot vqa with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li, D Tao… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：41 相关文章所有 3 个版本

[PDF] jair.org Full View

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org

In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

被引用次数：3 相关文章所有 6 个版本

[PDF] cell.com Full View

CX-ToM: Counterfactual explanations with theory-of-mind for enhancing human trust in image recognition models

AR Akula, K Wang, C Liu, S Saba-Sadiya, H Lu… - Iscience, 2022 - cell.com

We propose CX-ToM, short for counterfactual explanations with theory-of-mind, a new
explainable AI (XAI) framework for explaining decisions made by a deep convolutional …

被引用次数：46 相关文章所有 9 个版本

[PDF] arxiv.org

Reassessing evaluation practices in visual question answering: A case study on out-of-distribution generalization

A Agrawal, I Kajić, E Bugliarello, E Davoodi… - arXiv preprint arXiv …, 2022 - arxiv.org

Vision-and-language (V&L) models pretrained on large-scale multimodal data have
demonstrated strong performance on various tasks such as image captioning and visual …

被引用次数：18 相关文章所有 3 个版本

[PDF] arxiv.org

Attention cannot be an explanation

AR Akula, SC Zhu - arXiv preprint arXiv:2201.11194, 2022 - arxiv.org

Attention based explanations (viz. saliency maps), by providing interpretability to black box
models such as deep neural networks, are assumed to improve human trust and reliance in …

被引用次数：6 相关文章所有 2 个版本

高级搜索

QQ 群