Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Remoteclip: A vision language foundation model for remote sensing

F Liu, D Chen, Z Guan, X Zhou, J Zhu… - … on Geoscience and …, 2024 - ieeexplore.ieee.org
General-purpose foundation models have led to recent breakthroughs in artificial
intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

From Instructions to Intrinsic Human Values--A Survey of Alignment Goals for Big Models

J Yao, X Yi, X Wang, J Wang, X Xie - arXiv preprint arXiv:2308.12014, 2023 - arxiv.org
Big models, exemplified by Large Language Models (LLMs), are models typically pre-
trained on massive data and comprised of enormous parameters, which not only obtain …

Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4

L Wei, Z Jiang, W Huang, L Sun - arXiv preprint arXiv:2308.12067, 2023 - arxiv.org
Multimodal large language models acquire their instruction-following capabilities through a
two-stage training process: pre-training on image-text pairs and fine-tuning on supervised …

Vision-language instruction tuning: A review and analysis

C Li, Y Ge, D Li, Y Shan - Transactions on Machine Learning …, 2023 - openreview.net
Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs),
aiming to enhance the LLM's ability to generalize instruction execution and adapt to user …

Emu: Generative pretraining in multimodality

Q Sun, Q Yu, Y Cui, F Zhang, X Zhang… - The Twelfth …, 2023 - openreview.net
We present Emu, a multimodal foundation model that seamlessly generates images and text
in multimodal context. This omnivore model can take in any single-modality or multimodal …

Sparkles: Unlocking chats across multiple images for multimodal instruction-following models

Y Huang, Z Meng, F Liu, Y Su, N Collier… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models exhibit enhanced zero-shot performance on various tasks when fine-
tuned with instruction-following data. Multimodal instruction-following models extend these …