相关文章- 学术资源搜索

Distilled dual-encoder model for vision-language understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv preprint arXiv …, 2021 - arxiv.org

We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

被引用次数：24 相关文章所有 4 个版本

[PDF] arxiv.org

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org

Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

被引用次数：2454 相关文章所有 4 个版本

[PDF] thecvf.com

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

被引用次数：515 相关文章所有 7 个版本

[PDF] arxiv.org

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

被引用次数：234 相关文章所有 6 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] arxiv.org

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

被引用次数：57 相关文章所有 2 个版本

[PDF] arxiv.org

Unsupervised prompt learning for vision-language models

T Huang, J Chu, F Wei - arXiv preprint arXiv:2204.03649, 2022 - arxiv.org

Contrastive vision-language models like CLIP have shown great progress in transfer
learning. In the inference stage, the proper text description, also known as prompt, needs to …

被引用次数：117 相关文章所有 2 个版本

[PDF] aaai.org

Bridgetower: Building bridges between encoders in vision-language representation learning

X Xu, C Wu, S Rosenman, V Lal, W Che… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-
language representation learning in recent years. Current VL models either use lightweight …

被引用次数：42 相关文章所有 4 个版本

[PDF] arxiv.org

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu, H Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Learning without forgetting for vision-language models

DW Zhou, Y Zhang, J Ning, HJ Ye, DC Zhan… - arXiv preprint arXiv …, 2023 - arxiv.org

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real
world, which requires a learning system to adapt to new tasks without forgetting former ones …

被引用次数：19 相关文章所有 4 个版本

高级搜索

QQ 群

Distilled dual-encoder model for vision-language understanding

Lxmert: Learning cross-modality encoder representations from transformers

12-in-1: Multi-task vision and language representation learning

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Clippo: Image-and-language understanding from pixels only

Ufo: A unified transformer for vision-language representation learning

Unsupervised prompt learning for vision-language models

Bridgetower: Building bridges between encoders in vision-language representation learning

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Learning without forgetting for vision-language models

相关搜索

引用