YK Zhang,
S Lu, Y Li, Y Ma,
QG Chen,
Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs), initiated with a trained LLM, first align images
with text and then fine-tune on multimodal mixed inputs. However, the MLLM …