Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Exploring the frontier of vision-language models: A survey of current methodologies and future directions

A Ghosh, A Acharya, S Saha, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of
the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

W Zhang, M Cai, T Zhang, Y Zhuang, X Mao - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …

Generative Visual Instruction Tuning

J Hernandez, R Villegas, V Ordonez - arXiv preprint arXiv:2406.11262, 2024 - arxiv.org
We propose to use machine-generated instruction-following data to improve the zero-shot
capabilities of a large multimodal model with additional support for generative and image …

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Z Kong, A Goel, R Badlani, W Ping, R Valle… - arXiv preprint arXiv …, 2024 - arxiv.org
Augmenting large language models (LLMs) to understand audio--including non-speech
sounds and non-verbal speech--is critically important for diverse real-world applications of …

From Large Language Models to Large Multimodal Models: A Literature Review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com
With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Z Wang, Q Duan, YW Tai, CK Tang - arXiv preprint arXiv:2405.16136, 2024 - arxiv.org
We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel
framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together …

The Evolution of Multimodal Model Architectures

SN Wadekar, A Chaurasia, A Chadha… - arXiv preprint arXiv …, 2024 - arxiv.org
This work uniquely identifies and characterizes four prevalent multimodal model
architectural patterns in the contemporary multimodal landscape. Systematically …

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

S Yang, Z Zhong, M Zhao, S Takahashi, M Ishii… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, with the realistic generation results and a wide range of personalized
applications, diffusion-based generative models gain huge attention in both visual and …