12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com
Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com
There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

Unifying vision-and-language tasks via text generation

J Cho, J Lei, H Tan, M Bansal - International Conference on …, 2021 - proceedings.mlr.press
Existing methods for vision-and-language learning typically require designing task-specific
architectures and objectives for each task. For example, a multi-label answer classifier for …

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

Multi-task learning of hierarchical vision-language representation

DK Nguyen, T Okatani - … of the IEEE/CVF Conference on …, 2019 - openaccess.thecvf.com
It is still challenging to build an AI system that can perform tasks that involve vision and
language at human level. So far, researchers have singled out individual tasks separately …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …