In this paper, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial …
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the …
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices …
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA …
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …
People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …
Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path …
H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - … on Computer Vision, 2025 - Springer
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie, CLIP, for common vision tasks. However, for some special task that needs dense and fine …
Abstract We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats …