相关文章- 学术资源搜索

Generative Visual Instruction Tuning

J Hernandez, R Villegas, V Ordonez - arXiv preprint arXiv:2406.11262, 2024 - arxiv.org

We propose to use machine-generated instruction-following data to improve the zero-shot
capabilities of a large multimodal model with additional support for generative and image …

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

被引用次数：2149 相关文章所有 15 个版本

[PDF] arxiv.org

Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data

Y Li, C Zhang, G Yu, Z Wang, B Fu, G Lin… - arXiv preprint arXiv …, 2023 - arxiv.org

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked
significant interest in the development of multimodal Large Language Models (LLMs). A …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Mimic-it: Multi-modal in-context instruction tuning

B Li, Y Zhang, L Chen, J Wang, F Pu, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

High-quality instructions and responses are essential for the zero-shot performance of large
language models on interactive natural language tasks. For interactive vision-language …

被引用次数：416 相关文章所有 4 个版本

[PDF] arxiv.org

Behind the magic, merlim: Multi-modal evaluation benchmark for large image-language models

A Villa, JCL Alcázar, A Soto, B Ghanem - arXiv preprint arXiv:2312.02219, 2023 - arxiv.org

Large Vision and Language Models have enabled significant advances in fully supervised
and zero-shot vision tasks. These large pre-trained architectures serve as the baseline to …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Z Tang, Z Yang, M Khademi, Y Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-
context interleaved multimodal representations. By aligning modalities with language for …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org

We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …

被引用次数：157 相关文章所有 3 个版本

[PDF] arxiv.org

Guiding instruction-based image editing via multimodal large language models

TJ Fu, W Hu, X Du, WY Wang, Y Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Instruction-based image editing improves the controllability and flexibility of image
manipulation via natural commands without elaborate descriptions or regional masks …

被引用次数：29 相关文章所有 5 个版本

[PDF] arxiv.org

Kosmos-g: Generating images in context with multimodal large language models

X Pan, L Dong, S Huang, Z Peng, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I)
generation have made significant strides. However, the generation from generalized vision …

被引用次数：27 相关文章所有 3 个版本

[PDF] arxiv.org

MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4

V Azizi, F Koochaki - arXiv preprint arXiv:2406.00971, 2024 - arxiv.org

Vision-Language Models (VLMs) have recently seen significant advancements through
integrating with Large Language Models (LLMs). The VLMs, which process image and text …

高级搜索

QQ 群

Generative Visual Instruction Tuning

Visual instruction tuning

Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data

Mimic-it: Multi-modal in-context instruction tuning

Behind the magic, merlim: Multi-modal evaluation benchmark for large image-language models

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Pandagpt: One model to instruction-follow them all

Guiding instruction-based image editing via multimodal large language models

Kosmos-g: Generating images in context with multimodal large language models

MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4

引用