Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

On the challenges and perspectives of foundation models for medical image analysis

S Zhang, D Metaxas - Medical image analysis, 2024 - Elsevier
This article discusses the opportunities, applications and future directions of large-scale
pretrained models, ie, foundation models, which promise to significantly improve the …

Adversarial diffusion distillation

A Sauer, D Lorenz, A Blattmann… - European Conference on …, 2025 - Springer
Abstract We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that
efficiently samples large-scale foundational image diffusion models in just 1–4 steps while …

A foundation model for clinical-grade computational pathology and rare cancers detection

E Vorontsov, A Bozkurt, A Casson, G Shaikovski… - Nature medicine, 2024 - nature.com
The analysis of histopathology images with artificial intelligence aims to enable clinical
decision support systems and precision medicine. The success of such applications …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Depth anything: Unleashing the power of large-scale unlabeled data

L Yang, B Kang, Z Huang, X Xu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract This work presents Depth Anything a highly practical solution for robust monocular
depth estimation. Without pursuing novel technical modules we aim to build a simple yet …

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2025 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

End-to-end autonomous driving: Challenges and frontiers

L Chen, P Wu, K Chitta, B Jaeger… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The autonomous driving community has witnessed a rapid growth in approaches that
embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle …

Anydoor: Zero-shot object-level image customization

X Chen, L Huang, Y Liu, Y Shen… - Proceedings of the …, 2024 - openaccess.thecvf.com
This work presents AnyDoor a diffusion-based image generator with the power to teleport
target objects to new scenes at user-specified locations with desired shapes. Instead of …