Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Mm1: Methods, analysis & insights from multimodal llm pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Are We on the Right Way for Evaluating Large Vision-Language Models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention

R Zhang, J Han, C Liu, A Zhou, P Lu… - The Twelfth …, 2024 - openreview.net
With the rising tide of large language models (LLMs), there has been a growing interest in
developing general-purpose instruction-following models, eg, ChatGPT. To this end, we …

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered
unparalleled attention, due to their superior performance in visual contexts. However, their …

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

F Li, R Zhang, H Zhang, Y Zhang, B Li, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …

Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts

J Li, X Wang, S Zhu, CW Kuo, L Xu, F Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …