Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu, S Ji… - arXiv preprint arXiv …, 2024 - arxiv.org
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y Xia, X Li, Z Xia, A Chang, T Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

Et bench: Towards open-ended event-level video-language understanding

Y Liu, Z Ma, Z Qi, Y Wu, Y Shan, CW Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

Llms can evolve continually on modality for x-modal reasoning

J Yu, H Xiong, L Zhang, H Diao, Y Zhuge… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have gained significant attention due to their
impressive capabilities in multimodal understanding. However, existing methods rely heavily …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

K Han, Y Hu, M Qu, H Shi, Y Zhao, Y Wei - arXiv preprint arXiv:2412.00153, 2024 - arxiv.org
Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and
free-text segmentation, yet existing models still require predefined category prompts, limiting …

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

Z Wang, C Che, Q Wang, Y Li, Z Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to
effectively handle a wide range of vision tasks by framing them as language-based …

LLMs Can Evolve Continually on Modality for -Modal Reasoning

J Yu, H Xiong, L Zhang, H Diao, Y Zhuge… - The Thirty-eighth Annual … - openreview.net
Multimodal Large Language Models (MLLMs) have gained significant attention due to their
impressive capabilities in multimodal understanding. However, existing methods rely heavily …