相关文章- 学术资源搜索

Qwen-vl: A frontier large vision-language model with versatile abilities

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

被引用次数：413 相关文章所有 2 个版本

[PDF] openreview.net

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang, J Lin… - 2023 - openreview.net

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

被引用次数：166 相关文章

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

被引用次数：471 相关文章所有 6 个版本

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：125 相关文章所有 3 个版本

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1299 相关文章所有 7 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：99 相关文章所有 4 个版本

[PDF] arxiv.org

Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi …

F Liu, T Guan, Z Li, L Chen, Y Yacoob… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

被引用次数：66 相关文章所有 2 个版本

[PDF] arxiv.org

Touchstone: Evaluating vision-language models by language models

S Bai, S Yang, J Bai, P Wang, X Zhang, J Lin… - arXiv preprint arXiv …, 2023 - arxiv.org

Large vision-language models (LVLMs) have recently witnessed rapid advancements,
exhibiting a remarkable capacity for perceiving, understanding, and processing visual …

被引用次数：24 相关文章所有 2 个版本

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

被引用次数：35 相关文章所有 4 个版本

[PDF] aaai.org

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

被引用次数：884 相关文章所有 12 个版本

高级搜索

QQ 群

Qwen-vl: A frontier large vision-language model with versatile abilities

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Pali: A jointly-scaled multilingual language-image model

Video-llava: Learning united visual representation by alignment before projection

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Pali-x: On scaling up a multilingual vision and language model

Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi …

Touchstone: Evaluating vision-language models by language models

Deepseek-vl: towards real-world vision-language understanding

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

相关搜索

引用