Omnivl: One foundation model for image-language and video-language tasks

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc
As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

Univl: A unified video and language pre-training model for multimodal understanding and generation

H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li… - arXiv preprint arXiv …, 2020 - arxiv.org
With the recent success of the pre-training technique for NLP and image-linguistic tasks,
some video-linguistic pre-training works are gradually developed to improve video-text …

Crossclr: Cross-modal contrastive learning for multi-modal video representations

M Zolfaghari, Y Zhu, P Gehler… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs
from sets of negative samples. Recently, the principle has also been used to learn cross …

End-to-end learning of visual representations from uncurated instructional videos

A Miech, JB Alayrac, L Smaira… - Proceedings of the …, 2020 - openaccess.thecvf.com
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …

Text with knowledge graph augmented transformer for video captioning

X Gu, G Chen, Y Wang, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video captioning aims to describe the content of videos using natural language. Although
significant progress has been made, there is still much room to improve the performance for …

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …

Multimodal categorization of crisis events in social media

M Abavisani, L Wu, S Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Recent developments in image classification and natural language processing, coupled with
the rapid growth in social media usage, have enabled fundamental advances in detecting …