IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

C Li, H Liu, L Li, P Zhang, J Aneja… - Advances in …, 2022 - proceedings.neurips.cc

Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …

被引用次数：116 相关文章所有 8 个版本

[PDF] mit.edu

Visual spatial reasoning

F Liu, G Emerson, N Collier - Transactions of the Association for …, 2023 - direct.mit.edu

Spatial relations are a basic part of human cognition. However, they are expressed in
natural language in a variety of ways, and previous work has suggested that current vision …

被引用次数：125 相关文章所有 7 个版本

[PDF] arxiv.org

Modular deep learning

J Pfeiffer, S Ruder, I Vulić, EM Ponti - arXiv preprint arXiv:2302.11529, 2023 - arxiv.org

Transfer learning has recently become the dominant paradigm of machine learning. Pre-
trained models fine-tuned for downstream tasks achieve better performance with fewer …

被引用次数：76 相关文章所有 5 个版本

[PDF] arxiv.org

X²-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Y Zeng, X Zhang, H Li, J Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Vision language pre-training aims to learn alignments between vision and language from a
large amount of data. Most existing methods only learn image-text alignments. Some others …

被引用次数：44 相关文章所有 8 个版本

[PDF] aclanthology.org

mclip: Multilingual clip via cross-lingual transfer

G Chen, L Hou, Y Chen, W Dai, L Shang… - Proceedings of the …, 2023 - aclanthology.org

Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable
performance on various downstream cross-modal tasks. However, they are usually biased …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Large multilingual models pivot zero-shot multimodal learning across languages

J Hu, Y Yao, C Wang, S Wang, Y Pan, Q Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently there has been a significant surge in multimodal learning in terms of both image-to-
text and text-to-image generation. However, the success is typically limited to English …

被引用次数：23 相关文章所有 3 个版本

[PDF] aclanthology.org

Combining parameter-efficient modules for task-level generalisation

EM Ponti, A Sordoni, Y Bengio… - Proceedings of the 17th …, 2023 - aclanthology.org

A modular design encourages neural models to disentangle and recombine different facets
of knowledge to generalise more systematically to new tasks. In this work, we assume that …

被引用次数：26 相关文章

[PDF] arxiv.org

xGQA: Cross-lingual visual question answering

J Pfeiffer, G Geigle, A Kamath, JMO Steitz… - arXiv preprint arXiv …, 2021 - arxiv.org

Recent advances in multimodal vision and language modeling have predominantly focused
on the English language, mostly due to the lack of multilingual multimodal datasets to steer …

被引用次数：45 相关文章所有 5 个版本

[PDF] aclanthology.org

Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training

Z Li, Z Fan, J Chen, Q Zhang, XJ Huang… - Proceedings of the 61st …, 2023 - aclanthology.org

Abstract Multilingual Vision-Language Pre-training (VLP) is a promising but challenging
topic due to the lack of large-scale multilingual image-text pairs. Existing works address the …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Combining modular skills in multitask learning

EM Ponti, A Sordoni, Y Bengio, S Reddy - arXiv preprint arXiv:2202.13914, 2022 - arxiv.org

A modular design encourages neural models to disentangle and recombine different facets
of knowledge to generalise more systematically to new tasks. In this work, we assume that …

被引用次数：30 相关文章所有 2 个版本

高级搜索

QQ 群