Mova: Adapting mixture of vision experts to multimodal context

Z Zong, B Ma, D Shen, G Song, H Shao, D Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
As the key component in multimodal large language models (MLLMs), the ability of the
visual encoder greatly affects MLLM's understanding on diverse image content. Although …

F-LMM: Grounding Frozen Large Multimodal Models

S Wu, S Jin, W Zhang, L Xu, W Liu, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Endowing Large Multimodal Models (LMMs) with visual grounding capability can
significantly enhance AIs' understanding of the visual world and their interaction with …

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Q Zhou, R Zhou, Z Hu, P Lu, S Gao, Y Zhang - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have
significantly improved the performance of Large Language Models (LLMs) in complex …