Revive: Regional visual representation matters in knowledge-based visual question answering

KS Kalyan - Natural Language Processing Journal, 2024 - Elsevier

Large language models (LLMs) are a special class of pretrained language models (PLMs)
obtained by scaling model size, pretraining corpus and computation. LLMs, because of their …

被引用次数：223 相关文章所有 5 个版本

[PDF] arxiv.org

How to bridge the gap between modalities: A comprehensive survey on multimodal large language model

S Song, X Li, S Li, S Zhao, J Yu, J Ma, X Mao… - arXiv preprint arXiv …, 2023 - arxiv.org

This review paper explores Multimodal Large Language Models (MLLMs), which integrate
Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and …

被引用次数：25 相关文章所有 2 个版本

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

被引用次数：399 相关文章所有 6 个版本

[PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

被引用次数：197 相关文章所有 5 个版本

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：126 相关文章所有 5 个版本

[PDF] thecvf.com

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

被引用次数：77 相关文章所有 7 个版本

[PDF] neurips.cc

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

被引用次数：35 相关文章所有 5 个版本

[PDF] arxiv.org

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org

Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

被引用次数：97 相关文章所有 2 个版本

[PDF] neurips.cc

Avis: Autonomous visual information seeking with large language model agent

Z Hu, A Iscen, C Sun, KW Chang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …

被引用次数：53 相关文章所有 6 个版本

[PDF] thecvf.com

Promptcap: Prompt-guided image captioning for vqa with gpt-3

Y Hu, H Hua, Z Yang, W Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …

被引用次数：37 相关文章所有 4 个版本

高级搜索

QQ 群