Roses are red, violets are blue... but should vqa expect them to?

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：153 相关文章所有 7 个版本

[PDF] neurips.cc

Perception test: A diagnostic benchmark for multimodal video models

V Patraucean, L Smaira, A Gupta… - Advances in …, 2024 - proceedings.neurips.cc

We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …

被引用次数：22 相关文章所有 4 个版本

[PDF] thecvf.com

Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering

V Gupta, Z Li, A Kortylewski, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract While Visual Question Answering (VQA) has progressed rapidly, previous works
raise concerns about robustness of current VQA models. In this work, we study the …

被引用次数：54 相关文章所有 9 个版本

VISAtlas: An image-based exploration and query system for large visualization collections via neural image embedding

Y Ye, R Huang, W Zeng - IEEE Transactions on Visualization …, 2022 - ieeexplore.ieee.org

High-quality visualization collections are beneficial for a variety of applications including
visualization reference and data-driven visualization design. The visualization community …

被引用次数：41 相关文章所有 11 个版本

[PDF] arxiv.org

Benchmarking spatial relationships in text-to-image generation

T Gokhale, H Palangi, B Nushi, V Vineet… - arXiv preprint arXiv …, 2022 - arxiv.org

Spatial understanding is a fundamental aspect of computer vision and integral for human-
level reasoning about images, making it an important component for grounded language …

被引用次数：46 相关文章所有 2 个版本

[PDF] thecvf.com

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Z Li, X Wang, E Stengel-Eskin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) models often perform poorly on out-of-distribution
data and struggle on domain generalization. Due to the multi-modal nature of this task …

被引用次数：29 相关文章所有 8 个版本

[PDF] arxiv.org

Mutant: A training paradigm for out-of-distribution generalization in visual question answering

T Gokhale, P Banerjee, C Baral, Y Yang - arXiv preprint arXiv:2009.08566, 2020 - arxiv.org

While progress has been made on the visual question answering leaderboards, models
often utilize spurious correlations and priors in datasets under the iid setting. As such …

被引用次数：135 相关文章所有 9 个版本

[PDF] thecvf.com

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

C Dancette, R Cadene, D Teney… - Proceedings of the …, 2021 - openaccess.thecvf.com

We introduce an evaluation methodology for visual question answering (VQA) to better
diagnose cases of shortcut learning. These cases happen when a model exploits spurious …

被引用次数：76 相关文章所有 10 个版本

[PDF] thecvf.com

Improving selective visual question answering by learning from your peers

C Dancette, S Whitehead… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Despite advances in Visual Question Answering (VQA), the ability of models to
assess their own correctness remains underexplored. Recent work has shown that VQA …

被引用次数：14 相关文章所有 5 个版本

[PDF] arxiv.org

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

H Lovenia, W Dai, S Cahyawijaya, Z Ji… - arXiv preprint arXiv …, 2023 - arxiv.org

Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …

被引用次数：25 相关文章所有 2 个版本

高级搜索

QQ 群