相关文章- 学术资源搜索

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

被引用次数：515 相关文章所有 7 个版本

[PDF] arxiv.org

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

被引用次数：234 相关文章所有 6 个版本

[PDF] thecvf.com

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com

There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

被引用次数：39 相关文章所有 4 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] arxiv.org

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

被引用次数：73 相关文章所有 4 个版本

[PDF] mlr.press

Unifying vision-and-language tasks via text generation

J Cho, J Lei, H Tan, M Bansal - International Conference on …, 2021 - proceedings.mlr.press

Existing methods for vision-and-language learning typically require designing task-specific
architectures and objectives for each task. For example, a multi-label answer classifier for …

被引用次数：472 相关文章所有 6 个版本

[PDF] arxiv.org

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

被引用次数：57 相关文章所有 2 个版本

[PDF] thecvf.com

Multi-task learning of hierarchical vision-language representation

DK Nguyen, T Okatani - … of the IEEE/CVF Conference on …, 2019 - openaccess.thecvf.com

It is still challenging to build an AI system that can perform tasks that involve vision and
language at human level. So far, researchers have singled out individual tasks separately …

被引用次数：67 相关文章所有 9 个版本

[PDF] thecvf.com

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

被引用次数：42 相关文章所有 8 个版本

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：124 相关文章所有 3 个版本

高级搜索

QQ 群

12-in-1: Multi-task vision and language representation learning

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

Clippo: Image-and-language understanding from pixels only

Lavis: A library for language-vision intelligence

Unifying vision-and-language tasks via text generation

Ufo: A unified transformer for vision-language representation learning

Multi-task learning of hierarchical vision-language representation

Teaching structured vision & language concepts to vision & language models

Video-llava: Learning united visual representation by alignment before projection

相关搜索

引用