Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Mimic-it: Multi-modal in-context instruction tuning

B Li, Y Zhang, L Chen, J Wang, F Pu, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
High-quality instructions and responses are essential for the zero-shot performance of large
language models on interactive natural language tasks. For interactive vision-language …

Instructblip: Towards general-purpose vision-language models with instruction tuning

W Dai, J Li, D Li, AMH Tiong, J Zhao… - Advances in …, 2024 - proceedings.neurips.cc
Large-scale pre-training and instruction tuning have been successful at creating general-
purpose language models with broad competence. However, building general-purpose …

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Mixture of cluster-conditional lora experts for vision-language instruction tuning

Y Gou, Z Liu, K Chen, L Hong, H Xu, A Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Instruction tuning of the Large Vision-language Models (LVLMs) has revolutionized the
development of versatile models with zero-shot generalization across a wide range of …

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning

Z Xu, Y Shen, L Huang - arXiv preprint arXiv:2212.10773, 2022 - arxiv.org
Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on
tasks specified through instructions, has shown promising zero-shot performance on various …

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

Y Bitton, H Bansal, J Hessel, R Shao, W Zhu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of
instruction-following vision-language models for real-world use. Our starting point is curating …

Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4

L Wei, Z Jiang, W Huang, L Sun - arXiv preprint arXiv:2308.12067, 2023 - arxiv.org
Multimodal large language models acquire their instruction-following capabilities through a
two-stage training process: pre-training on image-text pairs and fine-tuning on supervised …

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …