The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

LocCa: Visual Pretraining with Location-aware Captioners

B Wan, M Tschannen, Y Xian, F Pavetic… - arXiv preprint arXiv …, 2024 - arxiv.org
Image captioning has been shown as an effective pretraining method similar to contrastive
pretraining. However, the incorporation of location-aware information into visual pretraining …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

J Yang, X Chen, N Madaan, M Iyengar, S Qian… - arXiv preprint arXiv …, 2024 - arxiv.org
The integration of language and 3D perception is crucial for developing embodied agents
and robots that comprehend and interact with the physical world. While large language …