Patching open-vocabulary models by interpolating weights

G Ilharco, M Wortsman, SY Gadre… - Advances in …, 2022 - proceedings.neurips.cc
Open-vocabulary models like CLIP achieve high accuracy across many image classification
tasks. However, there are still settings where their zero-shot performance is far from optimal …

A review of transformer-based approaches for image captioning

O Ondeng, H Ouma, P Akuon - Applied Sciences, 2023 - mdpi.com
Visual understanding is a research area that bridges the gap between computer vision and
natural language processing. Image captioning is a visual understanding task in which …

Spotlight: Mobile ui understanding using vision-language models with a focus

G Li, Y Li - arXiv preprint arXiv:2209.14927, 2022 - arxiv.org
Mobile UI understanding is important for enabling various interaction tasks such as UI
automation and accessibility. Previous mobile UI modeling often depends on the view …

Test-time distribution normalization for contrastively learned visual-language models

Y Zhou, J Ren, F Li, R Zabih… - Advances in Neural …, 2024 - proceedings.neurips.cc
Advances in the field of visual-language contrastive learning have made it possible for many
downstream applications to be carried out efficiently and accurately by simply taking the dot …

Scene-centric vs. object-centric image-text cross-modal retrieval: a reproducibility study

M Hendriksen, S Vakulenko, E Kuiper… - European Conference on …, 2023 - Springer
Most approaches to (CMR) focus either on object-centric datasets, meaning that each
document depicts or describes a single object, or on scene-centric datasets, meaning that …

Towards grounded visual spatial reasoning in multi-modal vision language models

N Rajabi, J Kosecka - arXiv preprint arXiv:2308.09778, 2023 - arxiv.org
With the advances in large scale vision-and-language models (VLMs) it is of interest to
assess their performance on various visual reasoning tasks such as counting, referring …

Language model crossover: Variation through few-shot prompting

E Meyerson, MJ Nelson, H Bradley, A Gaier… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper pursues the insight that language models naturally enable an intelligent variation
operator similar in spirit to evolutionary crossover. In particular, language models of …

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

X Zhang, W Li, X Wang, L Wang, F Zheng, L Wang… - Remote Sensing, 2023 - mdpi.com
In recent years, there has been a growing interest in remote sensing image–text cross-
modal retrieval due to the rapid development of space information technology and the …

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou… - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and
Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and …

[HTML][HTML] Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation

H Kerdegari, K Higgins, D Veselkov, I Laponogov… - Diagnostics, 2024 - mdpi.com
The integration of artificial intelligence (AI) in medical diagnostics represents a significant
advancement in managing upper gastrointestinal (GI) cancer, which is a major cause of …