相关文章- 学术资源搜索

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

J Ye, A Hu, H Xu, Q Ye, M Yan, G Xu, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Text is ubiquitous in our visual world, conveying crucial information, such as in documents,
websites, and everyday photographs. In this work, we propose UReader, a first exploration …

被引用次数：44 相关文章所有 5 个版本

[HTML] sciencedirect.com

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier

Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

被引用次数：209 相关文章所有 4 个版本

[PDF] arxiv.org

Enabling multimodal generation on clip via vision-language knowledge distillation

W Dai, L Hou, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (eg,
CLIP) with a tremendous amount of image-text pair data, has shown its superiority on …

被引用次数：78 相关文章所有 5 个版本

[PDF] academia.edu

[PDF][PDF] Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz, L Wolf - arXiv preprint arXiv …, 2021 - academia.edu

Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

被引用次数：46 相关文章

[PDF] thecvf.com

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

C Yi, L Ren, DC Zhan, HJ Ye - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

CLIP showcases exceptional cross-modal matching capabilities due to its training on image-
text contrastive learning tasks. However without specific optimization for unimodal scenarios …

被引用次数：2 相关文章所有 5 个版本

[PDF] thecvf.com

Alpha-clip: A clip model focusing on wherever you want

Z Sun, Y Fang, T Wu, P Zhang, Y Zang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) plays an essential role in
extracting valuable content information from images across diverse tasks. It aligns textual …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Contrastive language-image pre-training for the italian language

F Bianchi, G Attanasio, R Pisoni, S Terragni… - arXiv preprint arXiv …, 2021 - arxiv.org

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that
jointly learns representations of images and texts. The model is trained on a massive …

被引用次数：35 相关文章所有 8 个版本

[PDF] arxiv.org

Plug-and-play grounding of reasoning in multimodal large language models

J Chen, Y Liu, D Li, X An, Z Feng, Y Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent
capabilities in instruction following and reasoning, has greatly advanced the field of visual …

被引用次数：5 相关文章所有 2 个版本

[PDF] thecvf.com

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …

被引用次数：443 相关文章所有 7 个版本

[PDF] arxiv.org

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

Y Li, F Liang, L Zhao, Y Cui, W Ouyang, J Shao… - arXiv preprint arXiv …, 2021 - arxiv.org

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted
unprecedented attention for its impressive zero-shot recognition ability and excellent …

被引用次数：364 相关文章所有 3 个版本

高级搜索

QQ 群

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Enabling multimodal generation on clip via vision-language knowledge distillation

[PDF][PDF] Zero-shot image-to-text generation for visual-semantic arithmetic

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Alpha-clip: A clip model focusing on wherever you want

Contrastive language-image pre-training for the italian language

Plug-and-play grounding of reasoning in multimodal large language models

Lit: Zero-shot transfer with locked-image text tuning

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

引用