Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However the …
Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their …
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example although CLIP achieves impressive …
Images of the natural world collected by a variety of cameras from drones to individual phones are increasingly abundant sources of biological information. There is an explosion …
Vision-language foundation models have shown remarkable performance in various zero- shot settings such as image retrieval classification or captioning. But so far those models …
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data …
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision- centric approach. While stronger language models can enhance multimodal capabilities, the …
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions which tends to be noisy in web-crawled data. We …
Y Zhang, H Doughty… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Low-resource settings are well-established in natural lan-guage processing where many languages lack sufficient data for deep learning at scale. However low-resource problems …