相关文章- 学术资源搜索

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

被引用次数：54 相关文章所有 4 个版本

[PDF] arxiv.org

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Z Lin, C Liu, R Zhang, P Gao, L Qiu, H Xiao… - arXiv preprint arXiv …, 2023 - arxiv.org

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint
mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision …

被引用次数：93 相关文章所有 9 个版本

[PDF] thecvf.com

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：39 相关文章所有 2 个版本

[PDF] thecvf.com

Hallucination augmented contrastive learning for multimodal large language model

C Jiang, H Xu, M Dong, J Chen, W Ye… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multi-modal large language models (MLLMs) have been shown to efficiently integrate
natural language with visual information to handle multi-modal tasks. However MLLMs still …

被引用次数：19 相关文章所有 3 个版本

[PDF] thecvf.com

Vcoder: Versatile vision encoders for multimodal large language models

J Jain, J Yang, H Shi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Humans possess the remarkable skill of Visual Perception the ability to see and understand
the seen helping them make sense of the visual world and in turn reason. Multimodal Large …

被引用次数：7 相关文章所有 3 个版本

[PDF] thecvf.com

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

What Makes for Good Visual Tokenizers for Large Language Models?

G Wang, Y Ge, X Ding, M Kankanhalli… - arXiv preprint arXiv …, 2023 - arxiv.org

We empirically investigate proper pre-training methods to build good visual tokenizers,
making Large Language Models (LLMs) powerful Multimodal Large Language Models …

被引用次数：18 相关文章所有 3 个版本

[PDF] arxiv.org

Efficient multimodal learning from data-centric perspective

M He, Y Liu, B Wu, J Yuan, Y Wang, T Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in
general visual understanding and reasoning tasks. However, their deployment is hindered …

被引用次数：14 相关文章所有 2 个版本

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

X Wang, Y Zhou, X Liu, H Lu, Y Xu, F He… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a
variety of visual-language tasks. However, current MLLM benchmarks are predominantly …

被引用次数：26 相关文章所有 2 个版本

高级搜索

QQ 群

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Lion: Empowering multimodal large language model with dual-level visual knowledge

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Hallucination augmented contrastive learning for multimodal large language model

Vcoder: Versatile vision encoders for multimodal large language models

Honeybee: Locality-enhanced projector for multimodal llm

What Makes for Good Visual Tokenizers for Large Language Models?

Efficient multimodal learning from data-centric perspective

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

相关搜索

引用