相关文章- 学术资源搜索

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：34 相关文章所有 3 个版本

[PDF] arxiv.org

Mammut: A simple architecture for joint learning for multimodal tasks

W Kuo, AJ Piergiovanni, D Kim, X Luo, B Caine… - arXiv preprint arXiv …, 2023 - arxiv.org

The development of language models have moved from encoder-decoder to decoder-only
designs. In addition, we observe that the two most popular multimodal tasks, the generative …

被引用次数：19 相关文章所有 3 个版本

[PDF] thecvf.com

Intentqa: Context-aware video intent reasoning

J Li, P Wei, W Han, L Fan - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video
intent reasoning, which has become increasingly important for AI with its advantages in …

被引用次数：13 相关文章所有 5 个版本

[PDF] thecvf.com

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

被引用次数：82 相关文章所有 8 个版本

[PDF] thecvf.com

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

被引用次数：8 相关文章所有 3 个版本

[PDF] thecvf.com

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

Align and prompt: Video-and-language pre-training with entity prompts

D Li, J Li, H Li, JC Niebles… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a …

被引用次数：172 相关文章所有 5 个版本

[PDF] thecvf.com

Smaug: Sparse masked autoencoder for efficient video-language pre-training

Y Lin, C Wei, H Wang, A Yuille… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Video-language pre-training is crucial for learning powerful multi-modal representation.
However, it typically requires a massive amount of computation. In this paper, we develop …

被引用次数：10 相关文章所有 6 个版本

[PDF] arxiv.org

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer

Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

被引用次数：281 相关文章所有 6 个版本

[PDF] thecvf.com

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com

Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …

被引用次数：1329 相关文章所有 10 个版本

高级搜索

QQ 群

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Mammut: A simple architecture for joint learning for multimodal tasks

Intentqa: Context-aware video intent reasoning

Learning to answer questions in dynamic audio-visual scenarios

Lion: Empowering multimodal large language model with dual-level visual knowledge

Distilling vision-language models on millions of videos

Align and prompt: Video-and-language pre-training with entity prompts

Smaug: Sparse masked autoencoder for efficient video-language pre-training

Prompting visual-language models for efficient video understanding

Videobert: A joint model for video and language representation learning

引用