Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet
often constrained by the scarcity of high-quality image-text data. To address this bottleneck …

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

P Zhang, X Dong, B Wang, Y Cao, C Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

Topiq: A top-down approach from semantics to distortions for image quality assessment

C Chen, J Mo, J Hou, H Wu, L Liao… - … on Image Processing, 2024 - ieeexplore.ieee.org
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed
remarkable progress with deep neural networks. Inspired by the characteristics of the human …

Are We on the Right Way for Evaluating Large Vision-Language Models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

F Li, R Zhang, H Zhang, Y Zhang, B Li, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …