Multi-modal answer validation for knowledge-based vqa

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：144 相关文章所有 7 个版本

[PDF] arxiv.org

Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：16 相关文章所有 2 个版本

[PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

被引用次数：116 相关文章所有 5 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：104 相关文章所有 6 个版本

[PDF] aaai.org

An empirical study of gpt-3 for few-shot knowledge-based vqa

Z Yang, Z Gan, J Wang, X Hu, Y Lu, Z Liu… - Proceedings of the AAAI …, 2022 - ojs.aaai.org

Abstract Knowledge-based visual question answering (VQA) involves answering questions
that require external knowledge not present in the image. Existing methods first retrieve …

被引用次数：330 相关文章所有 6 个版本

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：72 相关文章所有 5 个版本

[PDF] thecvf.com

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

被引用次数：48 相关文章所有 7 个版本

[PDF] arxiv.org

Kat: A knowledge augmented transformer for vision-and-language

L Gui, B Wang, Q Huang, A Hauptmann, Y Bisk… - arXiv preprint arXiv …, 2021 - arxiv.org

The primary focus of recent work with largescale transformers has been on optimizing the
amount of information packed into the model's parameters. In this work, we ask a different …

被引用次数：127 相关文章所有 4 个版本

[PDF] arxiv.org

Language models are general-purpose interfaces

Y Hao, H Song, L Dong, S Huang, Z Chi… - arXiv preprint arXiv …, 2022 - arxiv.org

Foundation models have received much attention due to their effectiveness across a broad
range of downstream applications. Though there is a big convergence in terms of …

被引用次数：88 相关文章所有 2 个版本

[PDF] arxiv.org

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org

Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

被引用次数：73 相关文章所有 2 个版本

高级搜索

QQ 群