The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires …
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However the …
AC Li, M Prabhudesai, S Duggal… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a …
CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2024 - proceedings.neurips.cc
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $ understanding of vision-language models have permeated the machine learning ecosystem …
Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between …
The stunning qualitative improvement of text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of …
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …
Understanding and reasoning about spatial relationships is crucial for Visual Question Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive …
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and …