Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered
unparalleled attention, due to their superior performance in visual contexts. However, their …

Agieval: A human-centric benchmark for evaluating foundation models

W Zhong, R Cui, Y Guo, Y Liang, S Lu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the general abilities of foundation models to tackle human-level tasks is a vital
aspect of their development and application in the pursuit of Artificial General Intelligence …

Foundation models for decision making: Problems, methods, and opportunities

S Yang, O Nachum, Y Du, J Wei, P Abbeel… - arXiv preprint arXiv …, 2023 - arxiv.org
Foundation models pretrained on diverse data at scale have demonstrated extraordinary
capabilities in a wide range of vision and language tasks. When such models are deployed …

Unimath: A foundational and multimodal mathematical reasoner

Z Liang, T Yang, J Zhang, X Zhang - Proceedings of the 2023 …, 2023 - aclanthology.org
While significant progress has been made in natural language processing (NLP), existing
methods exhibit limitations in effectively interpreting and processing diverse mathematical …

[PDF][PDF] Gqa: a new dataset for compositional question answering over real-world images

DA Hudson, CD Manning - arXiv preprint arXiv …, 2019 - thetalkingmachines.com
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

Gqa: A new dataset for real-world visual reasoning and compositional question answering

DA Hudson, CD Manning - … of the IEEE/CVF conference on …, 2019 - openaccess.thecvf.com
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

Deplot: One-shot visual language reasoning by plot-to-table translation

F Liu, JM Eisenschlos, F Piccinno, S Krichene… - arXiv preprint arXiv …, 2022 - arxiv.org
Visual language such as charts and plots is ubiquitous in the human world. Comprehending
plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models …

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu… - arXiv preprint arXiv …, 2023 - arxiv.org
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …

Generating natural language explanations for visual question answering using scene graphs and visual attention

S Ghosh, G Burachas, A Ray, A Ziskind - arXiv preprint arXiv:1902.05715, 2019 - arxiv.org
In this paper, we present a novel approach for the task of eXplainable Question Answering
(XQA), ie, generating natural language (NL) explanations for the Visual Question Answering …