Distilled dual-encoder model for vision-language understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv preprint arXiv …, 2021 - arxiv.org
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org
Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com
Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

Unsupervised prompt learning for vision-language models

T Huang, J Chu, F Wei - arXiv preprint arXiv:2204.03649, 2022 - arxiv.org
Contrastive vision-language models like CLIP have shown great progress in transfer
learning. In the inference stage, the proper text description, also known as prompt, needs to …

Bridgetower: Building bridges between encoders in vision-language representation learning

X Xu, C Wu, S Rosenman, V Lal, W Che… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-
language representation learning in recent years. Current VL models either use lightweight …

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu, H Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

Learning without forgetting for vision-language models

DW Zhou, Y Zhang, J Ning, HJ Ye, DC Zhan… - arXiv preprint arXiv …, 2023 - arxiv.org
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real
world, which requires a learning system to adapt to new tasks without forgetting former ones …