[HTML][HTML] A survey of GPT-3 family large language models including ChatGPT and GPT-4

KS Kalyan - Natural Language Processing Journal, 2024 - Elsevier
Large language models (LLMs) are a special class of pretrained language models (PLMs)
obtained by scaling model size, pretraining corpus and computation. LLMs, because of their …

How to bridge the gap between modalities: A comprehensive survey on multimodal large language model

S Song, X Li, S Li, S Zhao, J Yu, J Ma, X Mao… - arXiv preprint arXiv …, 2023 - arxiv.org
This review paper explores Multimodal Large Language Models (MLLMs), which integrate
Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and …

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

Avis: Autonomous visual information seeking with large language model agent

Z Hu, A Iscen, C Sun, KW Chang… - Advances in …, 2024 - proceedings.neurips.cc
In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …

Promptcap: Prompt-guided image captioning for vqa with gpt-3

Y Hu, H Hua, Z Yang, W Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …