Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Z Lin, C Liu, R Zhang, P Gao, L Qiu, H Xiao… - arXiv preprint arXiv …, 2023 - arxiv.org
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint
mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision …

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Hallucination augmented contrastive learning for multimodal large language model

C Jiang, H Xu, M Dong, J Chen, W Ye… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multi-modal large language models (MLLMs) have been shown to efficiently integrate
natural language with visual information to handle multi-modal tasks. However MLLMs still …

Vcoder: Versatile vision encoders for multimodal large language models

J Jain, J Yang, H Shi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Humans possess the remarkable skill of Visual Perception the ability to see and understand
the seen helping them make sense of the visual world and in turn reason. Multimodal Large …

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

What Makes for Good Visual Tokenizers for Large Language Models?

G Wang, Y Ge, X Ding, M Kankanhalli… - arXiv preprint arXiv …, 2023 - arxiv.org
We empirically investigate proper pre-training methods to build good visual tokenizers,
making Large Language Models (LLMs) powerful Multimodal Large Language Models …

Efficient multimodal learning from data-centric perspective

M He, Y Liu, B Wu, J Yuan, Y Wang, T Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in
general visual understanding and reasoning tasks. However, their deployment is hindered …

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

X Wang, Y Zhou, X Liu, H Lu, Y Xu, F He… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a
variety of visual-language tasks. However, current MLLM benchmarks are predominantly …