Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Aligning large multimodal models with factually augmented rlhf

Z Sun, S Shen, S Cao, H Liu, C Li, Y Shen… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Multimodal Models (LMM) are built across modalities and the misalignment between
two modalities can result in" hallucination", generating textual outputs that are not grounded …

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

Emu: Generative pretraining in multimodality

Q Sun, Q Yu, Y Cui, F Zhang, X Zhang… - The Twelfth …, 2023 - openreview.net
We present Emu, a multimodal foundation model that seamlessly generates images and text
in multimodal context. This omnivore model can take in any single-modality or multimodal …

[HTML][HTML] Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Y Zhang, Y Pan, T Zhong, P Dong, K Xie, Y Liu, H Jiang… - Meta-Radiology, 2024 - Elsevier
Medical images and radiology reports are essential for physicians to diagnose medical
conditions. However, the vast diversity and cross-source heterogeneity inherent in these …

Query performance prediction using relevance judgments generated by large language models

C Meng, N Arabzadeh, A Askari, M Aliannejadi… - arXiv preprint arXiv …, 2024 - arxiv.org
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system
for a query without human relevance judgments. Previous QPP methods typically return a …

Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models

F Ma, Y Zhou, Y Zhang, S Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Inspired by the remarkable progress achieved by recent Large Language Models (LLMs)
Multimodal Large Language Models (MLLMs) take LLMs as their brains and have achieved …

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

W Zhang, T Lin, J Liu, F Shu, H Li, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements indicate that scaling up Multimodal Large Language Models
(MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing …