相关文章- 学术资源搜索

Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org

In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …

被引用次数：271 相关文章所有 2 个版本

[PDF] arxiv.org

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer

Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

被引用次数：2035 相关文章所有 7 个版本

[PDF] openreview.net

Uniter: Learning universal image-text representations

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - 2019 - openreview.net

Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are jointly processed for visual and textual understanding. In this …

被引用次数：403 相关文章所有 2 个版本

[PDF] google.com

Image-text embedding learning via visual and textual semantic reasoning

K Li, Y Zhang, K Li, Y Li, Y Fu - IEEE transactions on pattern …, 2022 - ieeexplore.ieee.org

As a bridge between language and vision domains, cross-modal retrieval between images
and texts is a hot research topic in recent years. It remains challenging because the current …

被引用次数：68 相关文章所有 6 个版本

[PDF] arxiv.org

Dual-path convolutional image-text embeddings with instance loss

Z Zheng, L Zheng, M Garrett, Y Yang, M Xu… - ACM Transactions on …, 2020 - dl.acm.org

Matching images and sentences demands a fine understanding of both modalities. In this
article, we propose a new system to discriminatively embed the image and text to a shared …

被引用次数：515 相关文章所有 10 个版本

[PDF] arxiv.org

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck, Z Peng, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state …

被引用次数：288 相关文章所有 3 个版本

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

被引用次数：369 相关文章所有 5 个版本

[PDF] aaai.org

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

被引用次数：898 相关文章所有 12 个版本

[PDF] aclanthology.org

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org

Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

被引用次数：84 相关文章所有 3 个版本

[PDF] thecvf.com

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

被引用次数：24 相关文章所有 6 个版本

高级搜索

QQ 群

Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

Uniter: Universal image-text representation learning

Uniter: Learning universal image-text representations

Image-text embedding learning via visual and textual semantic reasoning

Dual-path convolutional image-text embeddings with instance loss

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

Alip: Adaptive language-image pre-training with synthetic caption

相关搜索

引用