Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either …
H Liu, K Son, J Yang, C Liu, J Gao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale …
M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set …
J Wang, P Zhou, MZ Shou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we …
Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer …
Machines that can represent and describe environmental soundscapes have practical potential, eg, for audio tagging and captioning systems. Prevailing learning paradigms have …
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free- form text-image composition and comprehension. This model goes beyond conventional …
J Bi, D Cheng, P Yao, B Pang, Y Zhan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pretraining (VLP) has significantly improved the performance of various vision-language tasks with the matching of images and texts. In this paper, we …
M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only …