相关文章- 学术资源搜索

Cross-lingual cross-modal pretraining for multimodal retrieval

H Fei, T Yu, P Li - Proceedings of the 2021 Conference of the …, 2021 - aclanthology.org

Recent pretrained vision-language models have achieved impressive performance on cross-
modal retrieval tasks in English. Their success, however, heavily depends on the availability …

被引用次数：33 相关文章

[PDF] arxiv.org

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

被引用次数：58 相关文章所有 2 个版本

[PDF] arxiv.org

Understanding transferable representation learning and zero-shot transfer in clip

Z Chen, Y Deng, Y Li, Q Gu - arXiv preprint arXiv:2310.00927, 2023 - arxiv.org

Multi-modal learning has become increasingly popular due to its ability to leverage
information from different data sources (eg, text and images) to improve the model …

被引用次数：5 相关文章所有 4 个版本

[PDF] mit.edu

Decoupling the role of data, attention, and losses in multimodal transformers

LA Hendricks, J Mellor, R Schneider… - Transactions of the …, 2021 - direct.mit.edu

Recently, multimodal transformer models have gained popularity because their performance
on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on …

被引用次数：108 相关文章所有 8 个版本

[PDF] thecvf.com

Improving zero-shot generalization and robustness of multi-modal models

Y Ge, J Ren, A Gallagher, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive
performance on image classification benchmarks and their zero-shot generalization ability is …

被引用次数：25 相关文章所有 8 个版本

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1422 相关文章所有 7 个版本

[PDF] thecvf.com

Teaching clip to count to ten

R Paiss, A Ephrat, O Tov, S Zada… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large vision-language models, such as CLIP, learn robust representations of text and
images, facilitating advances in many downstream tasks, including zero-shot classification …

被引用次数：46 相关文章所有 10 个版本

[PDF] thecvf.com

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

被引用次数：24 相关文章所有 6 个版本

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

被引用次数：493 相关文章所有 6 个版本

[PDF] thecvf.com

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

被引用次数：428 相关文章所有 6 个版本

高级搜索

QQ 群

Cross-lingual cross-modal pretraining for multimodal retrieval

Ufo: A unified transformer for vision-language representation learning

Understanding transferable representation learning and zero-shot transfer in clip

Decoupling the role of data, attention, and losses in multimodal transformers

Improving zero-shot generalization and robustness of multi-modal models

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Teaching clip to count to ten

Alip: Adaptive language-image pre-training with synthetic caption

Pali: A jointly-scaled multilingual language-image model

Regionclip: Region-based language-image pretraining

引用