Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Gpt-4v (ision) is a generalist web agent, if grounded

B Zheng, B Gou, J Kil, H Sun, Y Su - arXiv preprint arXiv:2401.01614, 2024 - arxiv.org
The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …

Large multimodal models: Notes on cvpr 2023 tutorial

C Li - arXiv preprint arXiv:2306.14895, 2023 - arxiv.org
This tutorial note summarizes the presentation on``Large Multimodal Models: Towards
Building and Surpassing Multimodal GPT-4'', a part of CVPR 2023 tutorial on``Recent …

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

A Miyai, J Yang, J Zhang, Y Ming, Q Yu, G Irie… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces a novel and significant challenge for Vision Language Models
(VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to …

Understanding and Improving In-Context Learning on Vision-language Models

S Chen, Z Han, B He, M Buckley, P Torr… - arXiv preprint arXiv …, 2023 - openreview.net
In-context learning (ICL) on large language models (LLMs) has received great attention, and
this technique can also be applied to vision-language models (VLMs) built upon LLMs …

On the Potential and Limitations of Few-Shot In-Context Learning to Generate Metamorphic Specifications for Tax Preparation Software

D Srinivas, R Das, S Tizpaz-Niari, A Trivedi… - arXiv preprint arXiv …, 2023 - arxiv.org
Due to the ever-increasing complexity of income tax laws in the United States, the number of
US taxpayers filing their taxes using tax preparation software (henceforth, tax software) …

Empowering Vision-Language Models for Reasoning Ability through Large Language Models

Y Yang, X Zhang, J Xu, W Han - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Vision-language models (VLM) have shown excellent performance in vision-language tasks.
However, they sometimes lack sufficient reasoning ability. In contrast, large language …

Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection

Y Jiang, Y Wang - arXiv preprint arXiv:2407.12879, 2024 - arxiv.org
Large visual-language models (LVLMs) exhibit exceptional performance in visual-language
reasoning across diverse cross-modal benchmarks. Despite these advances, recent …

Unsolvable Problem Detection for Vision Language Models

A Miyai, J Yang, J Zhang, Y Ming, Q Yu, G Irie… - ICLR 2024 Workshop on … - openreview.net
This paper introduces a novel and significant challenge for Vision Language Models
(VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to …