相关文章- 学术资源搜索

T-mars: Improving visual representations by circumventing text feature learning

P Maini, S Goyal, ZC Lipton, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org

Large web-sourced multimodal datasets have powered a slew of new methods for learning
general-purpose visual representations, advancing the state of the art in computer vision …

被引用次数：20 相关文章所有 3 个版本

[PDF] neurips.cc

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

被引用次数：1885 相关文章所有 12 个版本

[PDF] thecvf.com

Map: Multimodal uncertainty-aware vision-language pre-training model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

B Shan, W Yin, Y Sun, H Tian, H Wu… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted
extensive attention from academia and industry due to their superior performance on various …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

S Geng, J Yuan, Y Tian, Y Chen, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited
both visual recognition and multimodal content understanding. The concise design brings …

被引用次数：34 相关文章所有 3 个版本

[PDF] arxiv.org

Exploring visual interpretability for contrastive language-image pre-training

Y Li, H Wang, Y Duan, H Xu, X Li - arXiv preprint arXiv:2209.07046, 2022 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily
available supervision of natural language. It improves the performance of downstream vision …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Seeing what you miss: Vision-language pre-training with semantic completion learning

Y Ji, R Tu, J Jiang, W Kong, C Cai… - Proceedings of the …, 2023 - openaccess.thecvf.com

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn
the correct corresponding information across different modalities. For this purpose, inspired …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Demystifying clip data

H Xu, S Xie, XE Tan, PY Huang, R Howes… - arXiv preprint arXiv …, 2023 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced
research and applications in computer vision, fueling modern recognition systems and …

被引用次数：55 相关文章所有 3 个版本

[PDF] aclanthology.org

Cross-lingual cross-modal pretraining for multimodal retrieval

H Fei, T Yu, P Li - Proceedings of the 2021 Conference of the …, 2021 - aclanthology.org

Recent pretrained vision-language models have achieved impressive performance on cross-
modal retrieval tasks in English. Their success, however, heavily depends on the availability …

被引用次数：32 相关文章

[PDF] arxiv.org

Eva-clip-18b: Scaling clip to 18 billion parameters

Q Sun, J Wang, Q Yu, Y Cui, F Zhang, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both
vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful …

被引用次数：10 相关文章所有 2 个版本

高级搜索

QQ 群

T-mars: Improving visual representations by circumventing text feature learning

Laion-5b: An open large-scale dataset for training next generation image-text models

Map: Multimodal uncertainty-aware vision-language pre-training model

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

Exploring visual interpretability for contrastive language-image pre-training

Seeing what you miss: Vision-language pre-training with semantic completion learning

Demystifying clip data

Cross-lingual cross-modal pretraining for multimodal retrieval

Eva-clip-18b: Scaling clip to 18 billion parameters

引用