Winoground: Probing vision and language models for visio-linguistic compositionality

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：105 相关文章所有 4 个版本

[PDF] thecvf.com

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

被引用次数：63 相关文章所有 4 个版本

[PDF] thecvf.com

Your diffusion model is secretly a zero-shot classifier

AC Li, M Prabhudesai, S Duggal… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent wave of large-scale text-to-image diffusion models has dramatically increased
our text-based image generation abilities. These models can generate realistic images for a …

被引用次数：110 相关文章所有 9 个版本

[PDF] neurips.cc

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2024 - proceedings.neurips.cc

In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …

被引用次数：53 相关文章所有 7 个版本

[PDF] openreview.net

When and why vision-language models behave like bags-of-words, and what to do about it?

M Yuksekgonul, F Bianchi, P Kalluri… - The Eleventh …, 2023 - openreview.net

Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode the compositional relationships between …

被引用次数：202 相关文章所有 3 个版本

[PDF] neurips.cc

Holistic evaluation of text-to-image models

T Lee, M Yasunaga, C Meng, Y Mai… - Advances in …, 2024 - proceedings.neurips.cc

The stunning qualitative improvement of text-to-image models has led to their widespread
attention and adoption. However, we lack a comprehensive quantitative understanding of …

被引用次数：48 相关文章所有 6 个版本

[PDF] arxiv.org

Training-free structured diffusion guidance for compositional text-to-image synthesis

W Feng, X He, TJ Fu, V Jampani, A Akula… - arXiv preprint arXiv …, 2022 - arxiv.org

Large-scale diffusion models have achieved state-of-the-art results on text-to-image
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …

被引用次数：196 相关文章所有 6 个版本

[PDF] thecvf.com

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

B Chen, Z Xu, S Kirmani, B Ichter… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding and reasoning about spatial relationships is crucial for Visual Question
Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive …

被引用次数：38 相关文章所有 5 个版本

[PDF] thecvf.com

Crepe: Can vision-language foundation models reason compositionally?

Z Ma, J Hong, MO Gul, M Gandhi… - Proceedings of the …, 2023 - openaccess.thecvf.com

A fundamental characteristic common to both human vision and natural language is their
compositional nature. Yet, despite the performance gains contributed by large vision and …

被引用次数：81 相关文章所有 9 个版本

高级搜索

QQ 群