Kvqa: Knowledge-aware visual question answering

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：164 相关文章所有 7 个版本

[PDF] researchgate.net

An analysis of graph convolutional networks and recent datasets for visual question answering

AA Yusuf, F Chong, M Xianling - Artificial Intelligence Review, 2022 - Springer

Graph neural network is a deep learning approach widely applied on structural and non-
structural scenarios due to its substantial performance and interpretability recently. In a non …

被引用次数：35 相关文章所有 5 个版本

[PDF] arxiv.org

Unifying large language models and knowledge graphs: A roadmap

S Pan, L Luo, Y Wang, C Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the
field of natural language processing and artificial intelligence, due to their emergent ability …

被引用次数：520 相关文章所有 5 个版本

[PDF] arxiv.org

A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

被引用次数：286 相关文章所有 5 个版本

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

被引用次数：175 相关文章所有 3 个版本

[PDF] arxiv.org

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

被引用次数：110 相关文章所有 3 个版本

[PDF] arxiv.org

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org

Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

被引用次数：151 相关文章所有 7 个版本

[PDF] neurips.cc

Revive: Regional visual representation matters in knowledge-based visual question answering

Y Lin, Y Xie, D Chen, Y Xu, C Zhu… - Advances in Neural …, 2022 - proceedings.neurips.cc

This paper revisits visual representation in knowledge-based visual question answering
(VQA) and demonstrates that using regional information in a better way can significantly …

被引用次数：84 相关文章所有 7 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：95 相关文章所有 2 个版本

[PDF] thecvf.com

Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering

F Gao, Q Ping, G Thattai, A Reganti… - Proceedings of the …, 2022 - openaccess.thecvf.com

Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend
the image, make use of relevant knowledge from the entire web, and digest all the …

被引用次数：77 相关文章所有 6 个版本

高级搜索

QQ 群