Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：320 相关文章所有 2 个版本

[PDF] arxiv.org

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

被引用次数：10 相关文章所有 2 个版本

[PDF] springer.com

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer

Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Mia-dpo: Multi-image augmented direct preference optimization for large vision-language models

Z Liu, Y Zang, X Dong, P Zhang, Y Cao, H Duan… - arXiv preprint arXiv …, 2024 - arxiv.org

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to
predict human preferences between visual inputs. This is typically achieved by using …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Points: Improving your vision-language model with affordable strategies

Y Liu, Z Zhao, Z Zhuang, L Tian, X Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, vision-language models have made significant strides, excelling in tasks like
optical character recognition and geometric problem-solving. However, several critical …

被引用次数：7 相关文章所有 3 个版本

[PDF] scichina.com

OCRBench: on the hidden mystery of OCR in large multimodal models

Y Liu, Z Li, M Huang, B Yang, W Yu, C Li… - Science China …, 2024 - Springer

Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

J Yang, D Yin, Y Zhou, F Rao, W Zhai, Y Cao… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in multi-modal large language models have propelled the
development of joint probabilistic models capable of both image understanding and …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Number it: Temporal Grounding Videos like Flipping Manga

Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S Xiao, M Chen, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群