相关文章- 学术资源搜索

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

被引用次数：55 相关文章所有 5 个版本

[PDF] arxiv.org

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org

Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

被引用次数：452 相关文章所有 4 个版本

[PDF] thecvf.com

Learning customized visual models with retrieval-augmented knowledge

H Liu, K Son, J Yang, C Liu, J Gao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer
ability. The high generality and usability of these visual models is achieved via a web-scale …

被引用次数：13 相关文章所有 5 个版本

[PDF] thecvf.com

Reproducible scaling laws for contrastive language-image learning

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …

被引用次数：370 相关文章所有 6 个版本

[PDF] thecvf.com

Position-guided text prompt for vision-language pre-training

J Wang, P Zhou, MZ Shou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Vision-Language Pre-Training (VLP) has shown promising capabilities to align
image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we …

被引用次数：24 相关文章所有 4 个版本

[PDF] arxiv.org

xGQA: Cross-lingual visual question answering

J Pfeiffer, G Geigle, A Kamath, JMO Steitz… - arXiv preprint arXiv …, 2021 - arxiv.org

Recent advances in multimodal vision and language modeling have predominantly focused
on the English language, mostly due to the lack of multilingual multimodal datasets to steer …

被引用次数：42 相关文章所有 5 个版本

[PDF] arxiv.org

Connecting the dots between audio and text without parallel data through visual knowledge transfer

Y Zhao, J Hessel, Y Yu, X Lu, R Zellers… - arXiv preprint arXiv …, 2021 - arxiv.org

Machines that can represent and describe environmental soundscapes have practical
potential, eg, for audio tagging and captioning systems. Prevailing learning paradigms have …

被引用次数：26 相关文章所有 6 个版本

[PDF] arxiv.org

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

被引用次数：68 相关文章所有 3 个版本

[PDF] thecvf.com

VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching

J Bi, D Cheng, P Yao, B Pang, Y Zhan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Vision-Language Pretraining (VLP) has significantly improved the performance of
various vision-language tasks with the matching of images and texts. In this paper, we …

被引用次数：2 相关文章所有 3 个版本

[PDF] thecvf.com

Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding

M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Most humans use visual imagination to understand and reason about language, but models
such as BERT reason about language using knowledge acquired during text-only …

被引用次数：7 相关文章所有 6 个版本

高级搜索

QQ 群

Altclip: Altering the language encoder in clip for extended language capabilities

Filip: Fine-grained interactive language-image pre-training

Learning customized visual models with retrieval-augmented knowledge

Reproducible scaling laws for contrastive language-image learning

Position-guided text prompt for vision-language pre-training

xGQA: Cross-lingual visual question answering

Connecting the dots between audio and text without parallel data through visual knowledge transfer

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching

Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding

引用