A case study on combining asr and visual features for generating instructional video captions

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc

This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …

被引用次数：116 相关文章所有 7 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：213 相关文章所有 9 个版本

[PDF] neurips.cc

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc

As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

被引用次数：349 相关文章所有 7 个版本

[PDF] thecvf.com

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

被引用次数：163 相关文章所有 6 个版本

[PDF] arxiv.org

Univl: A unified video and language pre-training model for multimodal understanding and generation

H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li… - arXiv preprint arXiv …, 2020 - arxiv.org

With the recent success of the pre-training technique for NLP and image-linguistic tasks,
some video-linguistic pre-training works are gradually developed to improve video-text …

被引用次数：417 相关文章所有 2 个版本

[PDF] thecvf.com

Crossclr: Cross-modal contrastive learning for multi-modal video representations

M Zolfaghari, Y Zhu, P Gehler… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs
from sets of negative samples. Recently, the principle has also been used to learn cross …

被引用次数：129 相关文章所有 10 个版本

[PDF] thecvf.com

End-to-end learning of visual representations from uncurated instructional videos

A Miech, JB Alayrac, L Smaira… - Proceedings of the …, 2020 - openaccess.thecvf.com

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …

被引用次数：732 相关文章所有 15 个版本

[PDF] thecvf.com

Text with knowledge graph augmented transformer for video captioning

X Gu, G Chen, Y Wang, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video captioning aims to describe the content of videos using natural language. Although
significant progress has been made, there is still much room to improve the performance for …

被引用次数：32 相关文章所有 6 个版本

[PDF] thecvf.com

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com

Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …

被引用次数：187 相关文章所有 9 个版本

[PDF] thecvf.com

Multimodal categorization of crisis events in social media

M Abavisani, L Wu, S Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com

Recent developments in image classification and natural language processing, coupled with
the rapid growth in social media usage, have enabled fundamental advances in detecting …

被引用次数：99 相关文章所有 8 个版本

高级搜索

QQ 群