Mini-gemini: Mining the potential of multi-modality vision language models

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

被引用次数：29 相关文章所有 4 个版本

[PDF] arxiv.org

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arXiv preprint arXiv …, 2024 - arxiv.org

In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

被引用次数：46 相关文章所有 2 个版本

[PDF] arxiv.org

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu, S Ji… - arXiv preprint arXiv …, 2024 - arxiv.org

Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Y Ge, S Zhao, J Zhu, Y Ge, K Yi, L Song, C Li… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid evolution of multimodal foundation model has demonstrated significant
progresses in vision-language understanding and generation, eg, our previous work SEED …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness

T Yu, H Zhang, Y Yao, Y Dang, D Chen, X Lu… - arXiv preprint arXiv …, 2024 - arxiv.org

Learning from feedback reduces the hallucination of multimodal large language models
(MLLMs) by aligning them with human preferences. While traditional methods rely on labor …

被引用次数：16 相关文章所有 2 个版本

高级搜索

QQ 群