Generative Visual Instruction Tuning

J Hernandez, R Villegas, V Ordonez - arXiv preprint arXiv:2406.11262, 2024 - arxiv.org
We propose to use machine-generated instruction-following data to improve the zero-shot
capabilities of a large multimodal model with additional support for generative and image …

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data

Y Li, C Zhang, G Yu, Z Wang, B Fu, G Lin… - arXiv preprint arXiv …, 2023 - arxiv.org
The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked
significant interest in the development of multimodal Large Language Models (LLMs). A …

Mimic-it: Multi-modal in-context instruction tuning

B Li, Y Zhang, L Chen, J Wang, F Pu, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
High-quality instructions and responses are essential for the zero-shot performance of large
language models on interactive natural language tasks. For interactive vision-language …

Behind the magic, merlim: Multi-modal evaluation benchmark for large image-language models

A Villa, JCL Alcázar, A Soto, B Ghanem - arXiv preprint arXiv:2312.02219, 2023 - arxiv.org
Large Vision and Language Models have enabled significant advances in fully supervised
and zero-shot vision tasks. These large pre-trained architectures serve as the baseline to …

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Z Tang, Z Yang, M Khademi, Y Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-
context interleaved multimodal representations. By aligning modalities with language for …

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org
We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …

Guiding instruction-based image editing via multimodal large language models

TJ Fu, W Hu, X Du, WY Wang, Y Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Instruction-based image editing improves the controllability and flexibility of image
manipulation via natural commands without elaborate descriptions or regional masks …

Kosmos-g: Generating images in context with multimodal large language models

X Pan, L Dong, S Huang, Z Peng, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I)
generation have made significant strides. However, the generation from generalized vision …

MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4

V Azizi, F Koochaki - arXiv preprint arXiv:2406.00971, 2024 - arxiv.org
Vision-Language Models (VLMs) have recently seen significant advancements through
integrating with Large Language Models (LLMs). The VLMs, which process image and text …