相关文章- 学术资源搜索

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com

There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

被引用次数：39 相关文章所有 4 个版本

[PDF] arxiv.org

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

C Li, M Yan, H Xu, F Luo, W Wang, B Bi… - arXiv preprint arXiv …, 2021 - arxiv.org

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed
rapid progress for learning cross-modal representations. Existing pre-training methods …

被引用次数：21 相关文章所有 3 个版本

[PDF] arxiv.org

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

J Wang, H Wang, J Deng, W Wu, D Zhang - arXiv preprint arXiv …, 2021 - arxiv.org

While large scale pre-training has achieved great achievements in bridging the gap
between vision and language, it still faces several challenges. First, the cost for pre-training …

被引用次数：35 相关文章所有 5 个版本

[PDF] thecvf.com

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

被引用次数：79 相关文章所有 9 个版本

[PDF] thecvf.com

Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment

M Zhou, L Yu, A Singh, M Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract Vision-and-Language (V+ L) pre-training models have achieved tremendous
success in recent years on various multi-modal benchmarks. However, the majority of …

被引用次数：31 相关文章所有 7 个版本

[PDF] researchgate.net

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

Y Huang, J Tang, Z Chen, R Zhang… - arXiv preprint arXiv …, 2023 - researchgate.net

Large-scale vision-language pre-training has shown promising advances on various
downstream tasks and achieved significant performance in multi-modal understanding and …

被引用次数：13 相关文章

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

被引用次数：225 相关文章所有 8 个版本

[PDF] thecvf.com

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

被引用次数：67 相关文章所有 6 个版本

[PDF] arxiv.org

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

B Shan, W Yin, Y Sun, H Tian, H Wu… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted
extensive attention from academia and industry due to their superior performance on various …

被引用次数：12 相关文章所有 2 个版本

[PDF] thecvf.com

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

被引用次数：515 相关文章所有 7 个版本

高级搜索

QQ 群

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

Vision-language pre-training with triple contrastive learning

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

12-in-1: Multi-task vision and language representation learning

引用