Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, M Wang, Y Li, S Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …

A comprehensive survey of foundation models in medicine

W Khan, S Leem, KB See, JK Wong… - IEEE Reviews in …, 2025 - ieeexplore.ieee.org
Foundation models (FMs) are large-scale deeplearning models that are developed using
large datasets and self-supervised learning methods. These models serve as a base for …

Improving medical multi-modal contrastive learning with expert annotations

Y Kumar, P Marttinen - European Conference on Computer Vision, 2024 - Springer
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert
annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in …

Fairclip: Harnessing fairness in vision-language learning

Y Luo, M Shi, MO Khan, MM Afzal… - Proceedings of the …, 2024 - openaccess.thecvf.com
Fairness is a critical concern in deep learning especially in healthcare where these models
influence diagnoses and treatment decisions. Although fairness has been investigated in the …

Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation

L Sun, J Zhao, M Han, C Xiong - arXiv preprint arXiv:2407.15268, 2024 - arxiv.org
Multimodal foundation models hold significant potential for automating radiology report
generation, thereby assisting clinicians in diagnosing cardiac diseases. However, generated …

Core-Periphery Multi-Modality Feature Alignment for Zero-Shot Medical Image Analysis

X Yu, L Zhang, Z Wu, D Zhu - IEEE Transactions on Medical …, 2024 - ieeexplore.ieee.org
Multi-modality learning, exemplified by the language-image pair pre-trained CLIP model,
has demonstrated remarkable performance in enhancing zero-shot capabilities and has …

Medical vision language pretraining: A survey

P Shrestha, S Amgain, B Khanal, CA Linte… - arXiv preprint arXiv …, 2023 - arxiv.org
Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to
the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and …

[PDF][PDF] Automatic Medical Report Generation: Methods and Applications

L Guo, AM Tahir, D Zhang, ZJ Wang… - … Transactions on Signal …, 2024 - nowpublishers.com
The increasing demand for medical imaging has surpassed the capacity of available
radiologists, leading to diagnostic delays and potential misdiagnoses. Artificial intelligence …

Foundation model for advancing healthcare: Challenges, opportunities, and future directions

Y He, F Huang, X Jiang, Y Nie, M Wang, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model, which is pre-trained on broad data and is able to adapt to a wide range
of tasks, is advancing healthcare. It promotes the development of healthcare artificial …

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

Y Xie, Q Chen, S Wang, MS To, I Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
Current vision-language pre-training (VLP) methodologies predominantly depend on paired
image-text datasets a resource that is challenging to acquire in radiology due to privacy …