Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Qilin-med-vl: Towards chinese large vision-language model for general healthcare

J Liu, Z Wang, Q Ye, D Chong, P Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have introduced a new era of proficiency in comprehending
complex healthcare and biomedical topics. However, there is a noticeable lack of models in …

Multilingual large language model: A survey of resources, taxonomy and frontiers

L Qin, Q Chen, Y Zhou, Z Chen, Y Li, L Liao… - arXiv preprint arXiv …, 2024 - arxiv.org
Multilingual Large Language Models are capable of using powerful Large Language
Models to handle and respond to queries in multiple languages, which achieves remarkable …

[HTML][HTML] From Large Language Models to Large Multimodal Models: A Literature Review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com
With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Parrot: Multilingual Visual Instruction Tuning

HL Sun, DW Zhou, Y Li, S Lu, C Yi, QG Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has
marked a significant step towards artificial general intelligence. Existing methods mainly …

VCR: Visual Caption Restoration

T Zhang, S Wang, L Li, G Zhang, P Taslakian… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Visual Caption Restoration (VCR), a novel vision-language task that
challenges models to accurately restore partially obscured texts using pixel-level hints within …

Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions

Y Liu, Z Liang, Y Wang, M He, J Li, B Zhao - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have exhibited impressive capabilities in
visual understanding and reasoning, providing sightly reasonable answers, such as image …

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

K Zhang, B Li, P Zhang, F Pu, JA Cahyono… - arXiv preprint arXiv …, 2024 - arxiv.org
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-
contamination benchmarks. Despite continuous exploration of language model evaluations …

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Y Qian, H Ye, JP Fauconnier, P Grasch, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large
language models (MLLMs) on their ability to strictly adhere to complex instructions. Our …

A Single Transformer for Scalable Vision-Language Modeling

Y Chen, X Wang, H Peng, H Ji - arXiv preprint arXiv:2407.06438, 2024 - arxiv.org
We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current
large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous …