Dockylin: A large multimodal model for visual document understanding with efficient visual slimming

J Zhang, W Yang, S Lai, Z Xie, L Jin - arXiv preprint arXiv:2406.19101, 2024 - arxiv.org
Current multimodal large language models (MLLMs) face significant challenges in visual
document understanding (VDU) tasks due to the high resolution, dense text, and complex …

LLAVADI: What Matters For Multimodal Large Language Models Distillation

S Xu, X Li, H Yuan, L Qi, Y Tong, MH Yang - arXiv preprint arXiv …, 2024 - arxiv.org
The recent surge in Multimodal Large Language Models (MLLMs) has showcased their
remarkable potential for achieving generalized intelligence by integrating visual …