Distilled dual-encoder model for vision-language understanding

R Pei, J Liu, W Li, B Shao, S Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Pre-training a vison-language model and then fine-tuning it on downstream tasks have
become a popular paradigm. However, pre-trained vison-language models with the …

被引用次数：23 相关文章所有 6 个版本

[PDF] arxiv.org

Multimodal large language models: A survey

J Wu, W Gan, Z Chen, S Wan… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org

The exploration of multimodal language models integrates multiple data types, such as
images, text, language, audio, and other heterogeneity. While the latest large language …

被引用次数：58 相关文章所有 5 个版本

[PDF] aaai.org

Bridgetower: Building bridges between encoders in vision-language representation learning

X Xu, C Wu, S Rosenman, V Lal, W Che… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-
language representation learning in recent years. Current VL models either use lightweight …

被引用次数：42 相关文章所有 4 个版本

[PDF] arxiv.org

Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer

Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

被引用次数：36 相关文章所有 7 个版本

[PDF] thecvf.com

Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation

K Cai, P Ren, Y Zhu, H Xu, J Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recently, semantic segmentation models trained with image-level text supervision have
shown promising results in challenging open-world scenarios. However, these models still …

被引用次数：7 相关文章所有 6 个版本

[PDF] aclanthology.org

mclip: Multilingual clip via cross-lingual transfer

G Chen, L Hou, Y Chen, W Dai, L Shang… - Proceedings of the …, 2023 - aclanthology.org

Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable
performance on various downstream cross-modal tasks. However, they are usually biased …

被引用次数：11 相关文章所有 4 个版本

[PDF] arxiv.org

Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency

P Ren, C Li, H Xu, Y Zhu, G Wang, J Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, great success has been made in learning visual representations from text
supervision, facilitating the emergence of text-supervised semantic segmentation. However …

被引用次数：30 相关文章所有 3 个版本

[PDF] arxiv.org

Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning

T Wang, W Zhou, Y Zeng, X Zhang - arXiv preprint arXiv:2210.07795, 2022 - arxiv.org

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of
vision-language tasks. However, popular VLMs usually consist of hundreds of millions of …

被引用次数：19 相关文章所有 4 个版本

[PDF] neurips.cc

Module-wise adaptive distillation for multimodality foundation models

C Liang, J Yu, MH Yang, M Brown… - Advances in …, 2024 - proceedings.neurips.cc

Pre-trained multimodal foundation models have demonstrated remarkable generalizability
but pose challenges for deployment due to their large sizes. One effective approach to …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Efficient vision-language pretraining with visual concepts and hierarchical alignment

M Shukor, G Couairon, M Cord - arXiv preprint arXiv:2208.13628, 2022 - arxiv.org

Vision and Language Pretraining has become the prevalent approach for tackling
multimodal downstream tasks. The current trend is to move towards ever larger models and …

被引用次数：16 相关文章所有 7 个版本

高级搜索

QQ 群