On the effectiveness of task granularity for transfer learning

F Mahdisoltani, G Berger, W Gharbieh, D Fleet… - arXiv preprint arXiv …, 2018 - arxiv.org
We describe a DNN for video classification and captioning, trained end-to-end, with shared
features, to solve tasks at different levels of granularity, exploring the link between …

[PDF][PDF] Fine-grained video classification and captioning

F Mahdisoltani, G Berger, W Gharbieh… - arXiv preprint arXiv …, 2018 - researchgate.net
We describe a DNN for fine-grained action classification and video captioning. It gives state-
of-the-art performance on the challenging Something-Something dataset, with over 220, 000 …

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com
Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …

Boosting video representation learning with multi-faceted integration

Z Qiu, T Yao, CW Ngo, XP Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The
existing datasets mostly label only one of the facets for model training, resulting in the video …

Video understanding as machine translation

B Korbar, F Petroni, R Girdhar, L Torresani - arXiv preprint arXiv …, 2020 - arxiv.org
With the advent of large-scale multimodal video datasets, especially sequences with audio
or transcribed speech, there has been a growing interest in self-supervised learning of video …

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

To create what you tell: Generating videos from captions

Y Pan, Z Qiu, T Yao, H Li, T Mei - Proceedings of the 25th ACM …, 2017 - dl.acm.org
We are creating multimedia contents everyday and everywhere. While automatic content
generation has played a fundamental challenge to multimedia community for decades …

Token mixing: parameter-efficient transfer learning from image-language to video-language

Y Liu, L Xu, P Xiong, Q Jin - Proceedings of the AAAI Conference on …, 2023 - ojs.aaai.org
Applying large scale pre-trained image-language model to video-language tasks has
recently become a trend, which brings two challenges. One is how to effectively transfer …

Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos

R Gupta, A Roy, C Christensen, S Kim… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent growth in the consumption of online media by children during early childhood
necessitates data-driven tools enabling educators to filter out appropriate educational …

Learning grounded vision-language representation for versatile understanding in untrimmed videos

T Wang, J Zhang, F Zheng, W Jiang, R Cheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Joint video-language learning has received increasing attention in recent years. However,
existing works mainly focus on single or multiple trimmed video clips (events), which makes …