A corpus for reasoning about natural language grounded in photographs

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：142 相关文章所有 7 个版本

[PDF] arxiv.org

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

被引用次数：2206 相关文章所有 8 个版本

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

被引用次数：360 相关文章所有 5 个版本

[PDF] thecvf.com

Visual programming: Compositional visual reasoning without training

T Gupta, A Kembhavi - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

We present VISPROG, a neuro-symbolic approach to solving complex and compositional
visual tasks given natural language instructions. VISPROG avoids the need for any task …

被引用次数：250 相关文章所有 7 个版本

[PDF] arxiv.org

Coca: Contrastive captioners are image-text foundation models

J Yu, Z Wang, V Vasudevan, L Yeung… - arXiv preprint arXiv …, 2022 - arxiv.org

Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …

被引用次数：1036 相关文章所有 7 个版本

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：143 相关文章所有 9 个版本

[PDF] mlr.press

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press

Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

被引用次数：2510 相关文章所有 5 个版本

[PDF] openreview.net

When and why vision-language models behave like bags-of-words, and what to do about it?

M Yuksekgonul, F Bianchi, P Kalluri… - The Eleventh …, 2023 - openreview.net

Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode the compositional relationships between …

被引用次数：192 相关文章所有 3 个版本

[PDF] neurips.cc

Lst: Ladder side-tuning for parameter and memory efficient transfer learning

YL Sung, J Cho, M Bansal - Advances in Neural …, 2022 - proceedings.neurips.cc

Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of
domains recently. However, it is costly to update the entire parameter set of large pre-trained …

被引用次数：149 相关文章所有 5 个版本

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

被引用次数：226 相关文章所有 8 个版本

高级搜索

QQ 群