Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Perception test: A diagnostic benchmark for multimodal video models

V Patraucean, L Smaira, A Gupta… - Advances in …, 2024 - proceedings.neurips.cc
We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …

Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering

V Gupta, Z Li, A Kortylewski, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract While Visual Question Answering (VQA) has progressed rapidly, previous works
raise concerns about robustness of current VQA models. In this work, we study the …

VISAtlas: An image-based exploration and query system for large visualization collections via neural image embedding

Y Ye, R Huang, W Zeng - IEEE Transactions on Visualization …, 2022 - ieeexplore.ieee.org
High-quality visualization collections are beneficial for a variety of applications including
visualization reference and data-driven visualization design. The visualization community …

Benchmarking spatial relationships in text-to-image generation

T Gokhale, H Palangi, B Nushi, V Vineet… - arXiv preprint arXiv …, 2022 - arxiv.org
Spatial understanding is a fundamental aspect of computer vision and integral for human-
level reasoning about images, making it an important component for grounded language …

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Z Li, X Wang, E Stengel-Eskin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Visual Question Answering (VQA) models often perform poorly on out-of-distribution
data and struggle on domain generalization. Due to the multi-modal nature of this task …

Mutant: A training paradigm for out-of-distribution generalization in visual question answering

T Gokhale, P Banerjee, C Baral, Y Yang - arXiv preprint arXiv:2009.08566, 2020 - arxiv.org
While progress has been made on the visual question answering leaderboards, models
often utilize spurious correlations and priors in datasets under the iid setting. As such …

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

C Dancette, R Cadene, D Teney… - Proceedings of the …, 2021 - openaccess.thecvf.com
We introduce an evaluation methodology for visual question answering (VQA) to better
diagnose cases of shortcut learning. These cases happen when a model exploits spurious …

Improving selective visual question answering by learning from your peers

C Dancette, S Whitehead… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Despite advances in Visual Question Answering (VQA), the ability of models to
assess their own correctness remains underexplored. Recent work has shown that VQA …

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

H Lovenia, W Dai, S Cahyawijaya, Z Ji… - arXiv preprint arXiv …, 2023 - arxiv.org
Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …