Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

P Xu, W Shao, K Zhang, P Gao, S Liu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …

Deep federated learning for autonomous driving

A Nguyen, T Do, M Tran, BX Nguyen… - 2022 IEEE Intelligent …, 2022 - ieeexplore.ieee.org
Autonomous driving is an active research topic in both academia and industry. However,
most of the existing solutions focus on improving the accuracy by training learnable models …

PEVL: Position-enhanced pre-training and prompt tuning for vision-language models

Y Yao, Q Chen, A Zhang, W Ji, Z Liu, TS Chua… - arXiv preprint arXiv …, 2022 - arxiv.org
Vision-language pre-training (VLP) has shown impressive performance on a wide range of
cross-modal tasks, where VLP models without reliance on object detectors are becoming the …

Variational causal inference network for explanatory visual question answering

D Xue, S Qian, C Xu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal
reasoning task that requires answering visual questions and generating multimodal …

Scaling-up medical vision-and-language representation learning with federated learning

S Lu, Z Liu, T Liu, W Zhou - Engineering Applications of Artificial …, 2023 - Elsevier
Abstract Medical Vision-and-Language Pre-training (MedVLP), which learns generic vision-
language representations from medical images and texts to benefit various downstream …

Enhancing visual question answering through ranking-based hybrid training and multimodal fusion

P Chen, Z Zhang, Y Dong, L Zhou, H Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Visual Question Answering (VQA) is a challenging task that requires systems to provide
accurate answers to questions based on image content. Current VQA models struggle with …

Towards truly zero-shot compositional visual reasoning with llms as programmers

A Stanić, S Caelles, M Tschannen - arXiv preprint arXiv:2401.01974, 2024 - arxiv.org
Visual reasoning is dominated by end-to-end neural networks scaled to billions of model
parameters and training examples. However, even the largest models struggle with …

SelfGraphVQA: a self-supervised graph neural network for scene-based question answering

BC de Oliveira Souza, M Aasan… - Proceedings of the …, 2023 - openaccess.thecvf.com
The intersection of vision and language is of major interest due to the increased focus on
seamless integration between recognition and reasoning. Scene graphs (SGs) have …

Benchmarking out-of-distribution detection in visual question answering

X Shi, S Lee - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
When faced with an out-of-distribution (OOD) question or image, visual question answering
(VQA) systems may provide unreliable answers. If relied on by real users or secondary …