Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Mammut: A simple architecture for joint learning for multimodal tasks

W Kuo, AJ Piergiovanni, D Kim, X Luo, B Caine… - arXiv preprint arXiv …, 2023 - arxiv.org
The development of language models have moved from encoder-decoder to decoder-only
designs. In addition, we observe that the two most popular multimodal tasks, the generative …

Intentqa: Context-aware video intent reasoning

J Li, P Wei, W Han, L Fan - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video
intent reasoning, which has become increasingly important for AI with its advantages in …

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

Align and prompt: Video-and-language pre-training with entity prompts

D Li, J Li, H Li, JC Niebles… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a …

Smaug: Sparse masked autoencoder for efficient video-language pre-training

Y Lin, C Wei, H Wang, A Yuille… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-language pre-training is crucial for learning powerful multi-modal representation.
However, it typically requires a massive amount of computation. In this paper, we develop …

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com
Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …