This monograph surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches …
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision- centric approach. While stronger language models can enhance multimodal capabilities, the …
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However …
The common practice for training commonsense models has gone from-human-to-corpus-to- machine: humans author commonsense knowledge graphs in order to train commonsense …
J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question …
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, eg, collapsing diverse" human walk on/sit on/lay on beach" into" human …
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task- agnostic joint representations of image content and natural language. We extend the …
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …
PW Koh, S Sagawa, H Marklund… - International …, 2021 - proceedings.mlr.press
Distribution shifts—where the training distribution differs from the test distribution—can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild …