Elevater: A benchmark and toolkit for evaluating language-augmented visual models

C Li, H Liu, L Li, P Zhang, J Aneja… - Advances in …, 2022 - proceedings.neurips.cc
Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …

Visual spatial reasoning

F Liu, G Emerson, N Collier - Transactions of the Association for …, 2023 - direct.mit.edu
Spatial relations are a basic part of human cognition. However, they are expressed in
natural language in a variety of ways, and previous work has suggested that current vision …

Modular deep learning

J Pfeiffer, S Ruder, I Vulić, EM Ponti - arXiv preprint arXiv:2302.11529, 2023 - arxiv.org
Transfer learning has recently become the dominant paradigm of machine learning. Pre-
trained models fine-tuned for downstream tasks achieve better performance with fewer …

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Y Zeng, X Zhang, H Li, J Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Vision language pre-training aims to learn alignments between vision and language from a
large amount of data. Most existing methods only learn image-text alignments. Some others …

mclip: Multilingual clip via cross-lingual transfer

G Chen, L Hou, Y Chen, W Dai, L Shang… - Proceedings of the …, 2023 - aclanthology.org
Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable
performance on various downstream cross-modal tasks. However, they are usually biased …

Large multilingual models pivot zero-shot multimodal learning across languages

J Hu, Y Yao, C Wang, S Wang, Y Pan, Q Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently there has been a significant surge in multimodal learning in terms of both image-to-
text and text-to-image generation. However, the success is typically limited to English …

Combining parameter-efficient modules for task-level generalisation

EM Ponti, A Sordoni, Y Bengio… - Proceedings of the 17th …, 2023 - aclanthology.org
A modular design encourages neural models to disentangle and recombine different facets
of knowledge to generalise more systematically to new tasks. In this work, we assume that …

xGQA: Cross-lingual visual question answering

J Pfeiffer, G Geigle, A Kamath, JMO Steitz… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent advances in multimodal vision and language modeling have predominantly focused
on the English language, mostly due to the lack of multilingual multimodal datasets to steer …

Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training

Z Li, Z Fan, J Chen, Q Zhang, XJ Huang… - Proceedings of the 61st …, 2023 - aclanthology.org
Abstract Multilingual Vision-Language Pre-training (VLP) is a promising but challenging
topic due to the lack of large-scale multilingual image-text pairs. Existing works address the …

Combining modular skills in multitask learning

EM Ponti, A Sordoni, Y Bengio, S Reddy - arXiv preprint arXiv:2202.13914, 2022 - arxiv.org
A modular design encourages neural models to disentangle and recombine different facets
of knowledge to generalise more systematically to new tasks. In this work, we assume that …