查看文章

arxiv.org 中的 [PDF]

Qwen-vl: A frontier large vision-language model with versatile abilities

作者

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

发表日期

2023/8/24

期刊

arXiv preprint arXiv:2308.12966

简介

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

引用总数

被引用次数：523

2023202463 458

学术搜索中的文章

Qwen-vl: A frontier large vision-language model with versatile abilities

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang, J Lin… - arXiv preprint arXiv:2308.12966, 2023

被引用次数：523 相关文章所有 2 个版本