Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

J Ye, A Hu, H Xu, Q Ye, M Yan, G Xu, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Text is ubiquitous in our visual world, conveying crucial information, such as in documents,
websites, and everyday photographs. In this work, we propose UReader, a first exploration …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

Enabling multimodal generation on clip via vision-language knowledge distillation

W Dai, L Hou, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (eg,
CLIP) with a tremendous amount of image-text pair data, has shown its superiority on …

[PDF][PDF] Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz, L Wolf - arXiv preprint arXiv …, 2021 - academia.edu
Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

C Yi, L Ren, DC Zhan, HJ Ye - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-
text contrastive learning tasks. However without specific optimization for unimodal scenarios …

Alpha-clip: A clip model focusing on wherever you want

Z Sun, Y Fang, T Wu, P Zhang, Y Zang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) plays an essential role in
extracting valuable content information from images across diverse tasks. It aligns textual …

Contrastive language-image pre-training for the italian language

F Bianchi, G Attanasio, R Pisoni, S Terragni… - arXiv preprint arXiv …, 2021 - arxiv.org
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that
jointly learns representations of images and texts. The model is trained on a massive …

Plug-and-play grounding of reasoning in multimodal large language models

J Chen, Y Liu, D Li, X An, Z Feng, Y Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent
capabilities in instruction following and reasoning, has greatly advanced the field of visual …

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

Y Li, F Liang, L Zhao, Y Cui, W Ouyang, J Shao… - arXiv preprint arXiv …, 2021 - arxiv.org
Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted
unprecedented attention for its impressive zero-shot recognition ability and excellent …