Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness

T Yu, H Zhang, Y Yao, Y Dang, D Chen, X Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
Learning from feedback reduces the hallucination of multimodal large language models
(MLLMs) by aligning them with human preferences. While traditional methods rely on labor …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

X Zhao, X Li, H Duan, H Huang, Y Li, K Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have made significant strides in various visual
understanding tasks. However, the majority of these models are constrained to process low …

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

MTR Laskar, S Alqahtani, MS Bari, M Rahman… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently gained significant attention due to their
remarkable capabilities in performing diverse tasks across various domains. However, a …

Parrot: Multilingual Visual Instruction Tuning

HL Sun, DW Zhou, Y Li, S Lu, C Yi, QG Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has
marked a significant step towards artificial general intelligence. Existing methods mainly …

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

BK Lee, CW Kim, B Park, YM Ro - arXiv preprint arXiv:2405.15574, 2024 - arxiv.org
The rapid development of large language and vision models (LLVMs) has been driven by
advances in visual instruction tuning. Recently, open-source LLVMs have curated high …

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

X Xiao, B Wu, J Wang, C Li, X Zhou, H Guo - arXiv preprint arXiv …, 2024 - arxiv.org
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text
token equally in an autoregressive manner. Despite being simple and effective, this method …