Demystifying clip data

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

被引用次数：253 相关文章所有 6 个版本

[PDF] thecvf.com

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

被引用次数：61 相关文章所有 4 个版本

[PDF] thecvf.com

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

被引用次数：14 相关文章所有 3 个版本

[PDF] thecvf.com

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

被引用次数：11 相关文章所有 3 个版本

[PDF] thecvf.com

Bioclip: A vision foundation model for the tree of life

S Stevens, J Wu, MJ Thompson… - Proceedings of the …, 2024 - openaccess.thecvf.com

Images of the natural world collected by a variety of cameras from drones to individual
phones are increasingly abundant sources of biological information. There is an explosion …

被引用次数：9 相关文章所有 4 个版本

[PDF] thecvf.com

Grounding everything: Emerging localization properties in vision-language transformers

W Bousselham, F Petersen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-language foundation models have shown remarkable performance in various zero-
shot settings such as image retrieval classification or captioning. But so far those models …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

HAAK Hammoud, H Itani, F Pizzati, P Torr… - arXiv preprint arXiv …, 2024 - arxiv.org

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …

被引用次数：12 相关文章所有 4 个版本

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

被引用次数：5 相关文章所有 3 个版本

[PDF] thecvf.com

MoDE: CLIP Data Experts via Clustering

J Ma, PY Huang, S Xie, SW Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …

被引用次数：3 相关文章所有 4 个版本

[PDF] thecvf.com

Low-Resource Vision Challenges for Foundation Models

Y Zhang, H Doughty… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Low-resource settings are well-established in natural lan-guage processing where many
languages lack sufficient data for deep learning at scale. However low-resource problems …

被引用次数：3 相关文章所有 3 个版本

高级搜索

QQ 群

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Probing the 3d awareness of visual foundation models

The Neglected Tails in Vision-Language Models

Bioclip: A vision foundation model for the tree of life

Grounding everything: Emerging localization properties in vision-language transformers

SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

MoDE: CLIP Data Experts via Clustering

Low-Resource Vision Challenges for Foundation Models

引用