Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge...

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：22 相关文章所有 2 个版本

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

被引用次数：287 相关文章所有 6 个版本

[PDF] arxiv.org

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

P Zhang, XDB Wang, Y Cao, C Xu, L Ouyang… - arXiv preprint arXiv …, 2023 - arxiv.org

We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …

被引用次数：110 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on generative ai and llm for video generation, understanding, and streaming

P Zhou, L Wang, Z Liu, Y Hao, P Hui, S Tarkoma… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …

被引用次数：12 相关文章所有 8 个版本

[PDF] neurips.cc

Avis: Autonomous visual information seeking with large language model agent

Z Hu, A Iscen, C Sun, KW Chang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …

被引用次数：42 相关文章所有 6 个版本

[PDF] arxiv.org

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

W Fan, Y Ding, L Ning, S Wang, H Li, D Yin… - Proceedings of the 30th …, 2024 - dl.acm.org

As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can
offer reliable and up-to-date external knowledge, providing huge convenience for numerous …

被引用次数：21 相关文章所有 2 个版本

[PDF] arxiv.org

Can pre-trained vision and language models answer visual information-seeking questions?

Y Chen, H Hu, Y Luan, H Sun, S Changpinyo… - arXiv preprint arXiv …, 2023 - arxiv.org

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over
existing tasks involving images and texts, including visual question answering. However, it …

被引用次数：40 相关文章所有 6 个版本

[PDF] arxiv.org

Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

被引用次数：53 相关文章所有 4 个版本

[PDF] arxiv.org

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

被引用次数：34 相关文章所有 5 个版本

[PDF] thecvf.com

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

被引用次数：7 相关文章所有 4 个版本

高级搜索

QQ 群