COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com
There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

C Li, M Yan, H Xu, F Luo, W Wang, B Bi… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed
rapid progress for learning cross-modal representations. Existing pre-training methods …

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

J Wang, H Wang, J Deng, W Wu, D Zhang - arXiv preprint arXiv …, 2021 - arxiv.org
While large scale pre-training has achieved great achievements in bridging the gap
between vision and language, it still faces several challenges. First, the cost for pre-training …

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment

M Zhou, L Yu, A Singh, M Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision-and-Language (V+ L) pre-training models have achieved tremendous
success in recent years on various multi-modal benchmarks. However, the majority of …

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

Y Huang, J Tang, Z Chen, R Zhang… - arXiv preprint arXiv …, 2023 - researchgate.net
Large-scale vision-language pre-training has shown promising advances on various
downstream tasks and achieved significant performance in multi-modal understanding and …

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

B Shan, W Yin, Y Sun, H Tian, H Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted
extensive attention from academia and industry due to their superior performance on various …

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com
Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …