Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Gemini: a family of highly capable multimodal models

G Team, R Anil, S Borgeaud, Y Wu, JB Alayrac… - arXiv preprint arXiv …, 2023 - arxiv.org
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Hypergraph attention networks for multimodal learning

ES Kim, WY Kang, KW On, YJ Heo… - Proceedings of the …, 2020 - openaccess.thecvf.com
One of the fundamental problems that arise in multimodal learning tasks is the disparity of
information levels between different modalities. To resolve this problem, we propose …

i-code: An integrative and composable multimodal learning framework

Z Yang, Y Fang, C Zhu, R Pryzant, D Chen… - Proceedings of the …, 2023 - ojs.aaai.org
Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to
maintain a holistic worldview. Most current pretraining methods, however, are limited to one …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

W Li, C Gao, G Niu, X Xiao, H Liu, J Liu, H Wu… - arXiv preprint arXiv …, 2020 - arxiv.org
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and
cannot effectively adapt to each other. They can only utilize single-modal data (ie text or …

Dynamic multimodal fusion

Z Xue, R Marculescu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Deep multimodal learning has achieved great progress in recent years. However, current
fusion approaches are static in nature, ie, they process and fuse multimodal inputs with …

Uibert: Learning generic multimodal representations for ui understanding

C Bai, X Zang, Y Xu, S Sunkara, A Rastogi… - arXiv preprint arXiv …, 2021 - arxiv.org
To improve the accessibility of smart devices and to simplify their usage, building models
which understand user interfaces (UIs) and assist users to complete their tasks is critical …