Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Z Tang, Z Yang, M Khademi, Y Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-
context interleaved multimodal representations. By aligning modalities with language for …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com
We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org
We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …