Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

L Zhang, X Zhai, Z Zhao, Y Zong… - Proceedings of the …, 2024 - openaccess.thecvf.com
Counterfactual reasoning a fundamental aspect of human cognition involves contemplating
alternatives to established facts or past events significantly enhancing our abilities in …

Omnisat: Self-supervised modality fusion for earth observation

G Astruc, N Gonthier, C Mallet, L Landrieu - European Conference on …, 2025 - Springer
The diversity and complementarity of sensors available for Earth Observations (EO) calls for
developing bespoke self-supervised multimodal learning approaches. However, current …

Fool your (vision and) language model with embarrassingly simple permutations

Y Zong, T Yu, R Chavhan, B Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language and vision-language models are rapidly being deployed in practice thanks
to their impressive capabilities in instruction following, in-context learning, and so on. This …

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Y Zong, O Bohdal, T Yu, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone
to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our …

Can self-supervised representation learning methodswithstand distribution shifts and corruptions?

PC Chhipa, JR Holmgren, K De… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self-supervised representation learning (SSL) in computer vision aims to leverage the
inherent structure and relationships within data to learn meaningful representations without …

Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

K Meidani, P Shojaee, CK Reddy… - arXiv preprint arXiv …, 2023 - arxiv.org
In an era where symbolic mathematical equations are indispensable for modeling complex
natural phenomena, scientific inquiry often involves collecting observations and translating …

What makes for good morphology representations for spatial omics?

E Chelebian, C Avenel, C Wählby - arXiv preprint arXiv:2407.20660, 2024 - arxiv.org
Spatial omics has transformed our understanding of tissue architecture by preserving spatial
context of gene expression patterns. Simultaneously, advances in imaging AI have enabled …

Prompt me up: Unleashing the power of alignments for multimodal entity and relation extraction

X Hu, J Chen, A Liu, S Meng, L Wen… - Proceedings of the 31st …, 2023 - dl.acm.org
How can we better extract entities and relations from text? Using multimodal extraction with
images and text obtains more signals for entities and relations, and aligns them through …

Medical vision language pretraining: A survey

P Shrestha, S Amgain, B Khanal, CA Linte… - arXiv preprint arXiv …, 2023 - arxiv.org
Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to
the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and …