Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

On Scaling Up a Multilingual Vision and Language Model

X Chen, J Djolonga, P Padlewski… - Proceedings of the …, 2024 - openaccess.thecvf.com
We explore the boundaries of scaling up a multilingual vision and language model both in
terms of size of the components and the breadth of its training task mixture. Our model …

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

Qwen-vl: A frontier large vision-language model with versatile abilities

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com
Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang, J Lin… - 2023 - openreview.net
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2024 - proceedings.neurips.cc
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck, Z Peng, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state …

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …