H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two …
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded …
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified …
M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many …
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …
T Huang, J Chu, F Wei - arXiv preprint arXiv:2204.03649, 2022 - arxiv.org
Contrastive vision-language models like CLIP have shown great progress in transfer learning. In the inference stage, the proper text description, also known as prompt, needs to …
Vision-Language (VL) models with the Two-Tower architecture have dominated visual- language representation learning in recent years. Current VL models either use lightweight …
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones …