Mm1: Methods, analysis & insights from multimodal llm pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts

Y Qian, H Zhang, Y Yang, Z Gan - arXiv preprint arXiv:2402.13220, 2024 - arxiv.org
The remarkable advancements in Multimodal Large Language Models (MLLMs) have not
rendered them immune to challenges, particularly in the context of handling deceptive …

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

C Ma, Y Jiang, J Wu, Z Yuan, X Qi - arXiv preprint arXiv:2404.13013, 2024 - arxiv.org
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-
grained visual perception ability. Beyond holistic image understanding, Groma is adept at …

SegPoint: Segment Any Point Cloud via Large Language Model

S He, H Ding, X Jiang, B Wen - arXiv preprint arXiv:2407.13761, 2024 - arxiv.org
Despite significant progress in 3D point cloud segmentation, existing methods primarily
address specific tasks and depend on explicit instructions to identify targets, lacking the …

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

LH Chen, S Lu, A Zeng, H Zhang, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
This study delves into the realm of multi-modality (ie, video and motion modalities) human
behavior understanding by leveraging the powerful capabilities of Large Language Models …

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

X Zhao, X Li, H Duan, H Huang, Y Li, K Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have made significant strides in various visual
understanding tasks. However, the majority of these models are constrained to process low …

Grounded 3D-LLM with Referent Tokens

Y Chen, S Yang, H Huang, T Wang, R Lyu, R Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

H Zhang, H You, P Dufter, B Zhang, C Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
While Ferret seamlessly integrates regional understanding into the Large Language Model
(LLM) to facilitate its referring and grounding capability, it poses certain limitations …

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

K You, H Zhang, E Schoop, F Weers… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in multimodal large language models (MLLMs) have been
noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend …