Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org
Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

Learning customized visual models with retrieval-augmented knowledge

H Liu, K Son, J Yang, C Liu, J Gao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer
ability. The high generality and usability of these visual models is achieved via a web-scale …

Reproducible scaling laws for contrastive language-image learning

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …

Position-guided text prompt for vision-language pre-training

J Wang, P Zhou, MZ Shou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pre-Training (VLP) has shown promising capabilities to align
image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we …

xGQA: Cross-lingual visual question answering

J Pfeiffer, G Geigle, A Kamath, JMO Steitz… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent advances in multimodal vision and language modeling have predominantly focused
on the English language, mostly due to the lack of multilingual multimodal datasets to steer …

Connecting the dots between audio and text without parallel data through visual knowledge transfer

Y Zhao, J Hessel, Y Yu, X Lu, R Zellers… - arXiv preprint arXiv …, 2021 - arxiv.org
Machines that can represent and describe environmental soundscapes have practical
potential, eg, for audio tagging and captioning systems. Prevailing learning paradigms have …

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching

J Bi, D Cheng, P Yao, B Pang, Y Zhan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pretraining (VLP) has significantly improved the performance of
various vision-language tasks with the matching of images and texts. In this paper, we …

Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding

M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Most humans use visual imagination to understand and reason about language, but models
such as BERT reason about language using knowledge acquired during text-only …