As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce …
Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li… - arXiv preprint arXiv …, 2020 - arxiv.org
With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text …
Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples. Recently, the principle has also been used to learn cross …
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M …
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for …
V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous …
M Abavisani, L Wu, S Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting …