Vizwiz grand challenge: Answering visual questions from blind people

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com

Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

被引用次数：180 相关文章所有 8 个版本

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：162 相关文章所有 7 个版本

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

被引用次数：2969 相关文章所有 15 个版本

[PDF] arxiv.org

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：2376 相关文章所有 4 个版本

[PDF] thecvf.com

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

被引用次数：931 相关文章所有 5 个版本

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

被引用次数：368 相关文章所有 5 个版本

[PDF] arxiv.org

Qwen-vl: A frontier large vision-language model with versatile abilities

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

被引用次数：567 相关文章所有 2 个版本

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

被引用次数：526 相关文章所有 6 个版本

[PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Large vision-language models have recently achieved remarkable progress, exhibiting
great perception and reasoning abilities concerning visual information. However, how to …

被引用次数：385 相关文章所有 3 个版本

[PDF] arxiv.org

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

被引用次数：315 相关文章所有 6 个版本

高级搜索

QQ 群