Coca: Contrastive captioners are image-text foundation models. arXiv 2022

G Ilharco, M Wortsman, SY Gadre… - Advances in …, 2022 - proceedings.neurips.cc

Open-vocabulary models like CLIP achieve high accuracy across many image classification
tasks. However, there are still settings where their zero-shot performance is far from optimal …

被引用次数：106 相关文章所有 6 个版本

[PDF] mdpi.com

A review of transformer-based approaches for image captioning

O Ondeng, H Ouma, P Akuon - Applied Sciences, 2023 - mdpi.com

Visual understanding is a research area that bridges the gap between computer vision and
natural language processing. Image captioning is a visual understanding task in which …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Spotlight: Mobile ui understanding using vision-language models with a focus

G Li, Y Li - arXiv preprint arXiv:2209.14927, 2022 - arxiv.org

Mobile UI understanding is important for enabling various interaction tasks such as UI
automation and accessibility. Previous mobile UI modeling often depends on the view …

被引用次数：31 相关文章所有 5 个版本

[PDF] neurips.cc

Test-time distribution normalization for contrastively learned visual-language models

Y Zhou, J Ren, F Li, R Zabih… - Advances in Neural …, 2024 - proceedings.neurips.cc

Advances in the field of visual-language contrastive learning have made it possible for many
downstream applications to be carried out efficiently and accurately by simply taking the dot …

被引用次数：10 相关文章所有 7 个版本

[PDF] arxiv.org

Scene-centric vs. object-centric image-text cross-modal retrieval: a reproducibility study

M Hendriksen, S Vakulenko, E Kuiper… - European Conference on …, 2023 - Springer

Most approaches to (CMR) focus either on object-centric datasets, meaning that each
document depicts or describes a single object, or on scene-centric datasets, meaning that …

被引用次数：15 相关文章所有 6 个版本

[PDF] arxiv.org

Towards grounded visual spatial reasoning in multi-modal vision language models

N Rajabi, J Kosecka - arXiv preprint arXiv:2308.09778, 2023 - arxiv.org

With the advances in large scale vision-and-language models (VLMs) it is of interest to
assess their performance on various visual reasoning tasks such as counting, referring …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Language model crossover: Variation through few-shot prompting

E Meyerson, MJ Nelson, H Bradley, A Gaier… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper pursues the insight that language models naturally enable an intelligent variation
operator similar in spirit to evolutionary crossover. In particular, language models of …

被引用次数：39 相关文章所有 3 个版本

[PDF] mdpi.com

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

X Zhang, W Li, X Wang, L Wang, F Zheng, L Wang… - Remote Sensing, 2023 - mdpi.com

In recent years, there has been a growing interest in remote sensing image–text cross-
modal retrieval due to the rapid development of space information technology and the …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou… - arXiv preprint arXiv …, 2023 - arxiv.org

In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and
Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and …

被引用次数：2 相关文章所有 2 个版本

[HTML] mdpi.com

[HTML][HTML] Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation

H Kerdegari, K Higgins, D Veselkov, I Laponogov… - Diagnostics, 2024 - mdpi.com

The integration of artificial intelligence (AI) in medical diagnostics represents a significant
advancement in managing upper gastrointestinal (GI) cancer, which is a major cause of …

高级搜索

QQ 群