A fistful of words: Learning transferable visual models from bag-of-words supervision

M Yuksekgonul, F Bianchi, P Kalluri, D Jurafsky… - arXiv preprint arXiv …, 2022 - arxiv.org

Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …

被引用次数：317 相关文章所有 2 个版本

[PDF] thecvf.com

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S Jin, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …

被引用次数：108 相关文章所有 5 个版本

[PDF] thecvf.com

Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

H Bansal, N Singhi, Y Yang, F Yin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multimodal contrastive pretraining has been used to train multimodal representation models,
such as CLIP, on large amounts of paired image-text data. However, previous studies have …

被引用次数：47 相关文章所有 6 个版本

[PDF] springer.com

Few-shot adaptation of multi-modal foundation models: A survey

F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024 - Springer

Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …

被引用次数：18 相关文章所有 4 个版本

[PDF] arxiv.org

Detecting and understanding harmful memes: A survey

S Sharma, F Alam, MS Akhtar, D Dimitrov… - arXiv preprint arXiv …, 2022 - arxiv.org

The automatic identification of harmful content online is of major concern for social media
platforms, policymakers, and society. Researchers have studied textual, visual, and audio …

被引用次数：61 相关文章所有 6 个版本

[PDF] neurips.cc

Cyclip: Cyclic contrastive language-image pretraining

S Goel, H Bansal, S Bhatia, R Rossi… - Advances in …, 2022 - proceedings.neurips.cc

Recent advances in contrastive representation learning over paired image-text data have
led to models such as CLIP that achieve state-of-the-art performance for zero-shot …

被引用次数：152 相关文章所有 9 个版本

[PDF] arxiv.org

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …

被引用次数：15 相关文章所有 5 个版本

[PDF] thecvf.com

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

被引用次数：4 相关文章

[PDF] thecvf.com

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

A Bulat, Y Ouali… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Despite noise and caption quality having been acknowledged as important factors impacting
vision-language contrastive pre-training in this paper we show that the full potential of …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

Idea: Increasing text diversity via online multi-label recognition for vision-language pre-training

X Huang, Y Zhang, Y Cheng, W Tian, R Zhao… - Proceedings of the 30th …, 2022 - dl.acm.org

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated
superior performance in various fields. However, the image-text pairs co-occurrent on the …

被引用次数：14 相关文章所有 3 个版本

高级搜索

QQ 群