Clipping: Distilling clip-based models with a student base for video-language retrieval

R Pei, J Liu, W Li, B Shao, S Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Pre-training a vison-language model and then fine-tuning it on downstream tasks have
become a popular paradigm. However, pre-trained vison-language models with the …

Multimodal large language models: A survey

J Wu, W Gan, Z Chen, S Wan… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
The exploration of multimodal language models integrates multiple data types, such as
images, text, language, audio, and other heterogeneity. While the latest large language …

Bridgetower: Building bridges between encoders in vision-language representation learning

X Xu, C Wu, S Rosenman, V Lal, W Che… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-
language representation learning in recent years. Current VL models either use lightweight …

Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer
Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation

K Cai, P Ren, Y Zhu, H Xu, J Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently, semantic segmentation models trained with image-level text supervision have
shown promising results in challenging open-world scenarios. However, these models still …

mclip: Multilingual clip via cross-lingual transfer

G Chen, L Hou, Y Chen, W Dai, L Shang… - Proceedings of the …, 2023 - aclanthology.org
Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable
performance on various downstream cross-modal tasks. However, they are usually biased …

Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency

P Ren, C Li, H Xu, Y Zhu, G Wang, J Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, great success has been made in learning visual representations from text
supervision, facilitating the emergence of text-supervised semantic segmentation. However …

Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning

T Wang, W Zhou, Y Zeng, X Zhang - arXiv preprint arXiv:2210.07795, 2022 - arxiv.org
Pre-trained vision-language models (VLMs) have achieved impressive results in a range of
vision-language tasks. However, popular VLMs usually consist of hundreds of millions of …

Module-wise adaptive distillation for multimodality foundation models

C Liang, J Yu, MH Yang, M Brown… - Advances in …, 2024 - proceedings.neurips.cc
Pre-trained multimodal foundation models have demonstrated remarkable generalizability
but pose challenges for deployment due to their large sizes. One effective approach to …

Efficient vision-language pretraining with visual concepts and hierarchical alignment

M Shukor, G Couairon, M Cord - arXiv preprint arXiv:2208.13628, 2022 - arxiv.org
Vision and Language Pretraining has become the prevalent approach for tackling
multimodal downstream tasks. The current trend is to move towards ever larger models and …