Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

J Lu, C Clark, S Lee, Z Zhang, S Khosla… - arXiv e …, 2023 - ui.adsabs.harvard.edu
We present Unified-IO 2, the first autoregressive multimodal model that is capable of
understanding and generating image, text, audio, and action. To unify different modalities …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

J Lu, C Clark, S Lee, Z Zhang, S Khosla… - arXiv preprint arXiv …, 2023 - arxiv.org
We present Unified-IO 2, the first autoregressive multimodal model that is capable of
understanding and generating image, text, audio, and action. To unify different modalities …