How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

Mia-dpo: Multi-image augmented direct preference optimization for large vision-language models

Z Liu, Y Zang, X Dong, P Zhang, Y Cao, H Duan… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual preference alignment involves training Large Vision-Language Models (LVLMs) to
predict human preferences between visual inputs. This is typically achieved by using …

Points: Improving your vision-language model with affordable strategies

Y Liu, Z Zhao, Z Zhuang, L Tian, X Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, vision-language models have made significant strides, excelling in tasks like
optical character recognition and geometric problem-solving. However, several critical …

OCRBench: on the hidden mystery of OCR in large multimodal models

Y Liu, Z Li, M Huang, B Yang, W Yu, C Li… - Science China …, 2024 - Springer
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

J Yang, D Yin, Y Zhou, F Rao, W Zhai, Y Cao… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in multi-modal large language models have propelled the
development of joint probabilistic models capable of both image understanding and …

Number it: Temporal Grounding Videos like Flipping Manga

Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S Xiao, M Chen, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …