- 学术资源搜索

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

被引用次数：125 相关文章所有 2 个版本

[PDF] thecvf.com

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

L Zhang, X Zhai, Z Zhao, Y Zong… - Proceedings of the …, 2024 - openaccess.thecvf.com

Counterfactual reasoning a fundamental aspect of human cognition involves contemplating
alternatives to established facts or past events significantly enhancing our abilities in …

被引用次数：16 相关文章所有 7 个版本

[PDF] arxiv.org

Omnisat: Self-supervised modality fusion for earth observation

G Astruc, N Gonthier, C Mallet, L Landrieu - European Conference on …, 2025 - Springer

The diversity and complementarity of sensors available for Earth Observations (EO) calls for
developing bespoke self-supervised multimodal learning approaches. However, current …

被引用次数：10 相关文章所有 12 个版本

[PDF] arxiv.org

Fool your (vision and) language model with embarrassingly simple permutations

Y Zong, T Yu, R Chavhan, B Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language and vision-language models are rapidly being deployed in practice thanks
to their impressive capabilities in instruction following, in-context learning, and so on. This …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Y Zong, O Bohdal, T Yu, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone
to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our …

被引用次数：48 相关文章所有 4 个版本

[PDF] thecvf.com

Can self-supervised representation learning methodswithstand distribution shifts and corruptions?

PC Chhipa, JR Holmgren, K De… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised representation learning (SSL) in computer vision aims to leverage the
inherent structure and relationships within data to learn meaningful representations without …

被引用次数：6 相关文章所有 6 个版本

[PDF] arxiv.org

Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

K Meidani, P Shojaee, CK Reddy… - arXiv preprint arXiv …, 2023 - arxiv.org

In an era where symbolic mathematical equations are indispensable for modeling complex
natural phenomena, scientific inquiry often involves collecting observations and translating …

被引用次数：15 相关文章所有 5 个版本

[PDF] arxiv.org

What makes for good morphology representations for spatial omics?

E Chelebian, C Avenel, C Wählby - arXiv preprint arXiv:2407.20660, 2024 - arxiv.org

Spatial omics has transformed our understanding of tissue architecture by preserving spatial
context of gene expression patterns. Simultaneously, advances in imaging AI have enabled …

被引用次数：1 相关文章所有 3 个版本

[PDF] acm.org

Prompt me up: Unleashing the power of alignments for multimodal entity and relation extraction

X Hu, J Chen, A Liu, S Meng, L Wen… - Proceedings of the 31st …, 2023 - dl.acm.org

How can we better extract entities and relations from text? Using multimodal extraction with
images and text obtains more signals for entities and relations, and aligns them through …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Medical vision language pretraining: A survey

P Shrestha, S Amgain, B Khanal, CA Linte… - arXiv preprint arXiv …, 2023 - arxiv.org

Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to
the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and …

被引用次数：14 相关文章所有 3 个版本

高级搜索

QQ 群

Foundation Models Defining a New Era in Vision: a Survey and Outlook

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

Omnisat: Self-supervised modality fusion for earth observation

Fool your (vision and) language model with embarrassingly simple permutations

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Can self-supervised representation learning methodswithstand distribution shifts and corruptions?

Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

What makes for good morphology representations for spatial omics?

Prompt me up: Unleashing the power of alignments for multimodal entity and relation extraction

Medical vision language pretraining: A survey

引用