相关文章- 学术资源搜索

On the effectiveness of task granularity for transfer learning

F Mahdisoltani, G Berger, W Gharbieh, D Fleet… - arXiv preprint arXiv …, 2018 - arxiv.org

We describe a DNN for video classification and captioning, trained end-to-end, with shared
features, to solve tasks at different levels of granularity, exploring the link between …

被引用次数：51 相关文章所有 3 个版本

[PDF] researchgate.net

[PDF][PDF] Fine-grained video classification and captioning

F Mahdisoltani, G Berger, W Gharbieh… - arXiv preprint arXiv …, 2018 - researchgate.net

We describe a DNN for fine-grained action classification and video captioning. It gives state-
of-the-art performance on the challenging Something-Something dataset, with over 220, 000 …

被引用次数：46 相关文章

[PDF] thecvf.com

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com

Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …

被引用次数：1417 相关文章所有 10 个版本

[PDF] thecvf.com

Boosting video representation learning with multi-faceted integration

Z Qiu, T Yao, CW Ngo, XP Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

Video content is multifaceted, consisting of objects, scenes, interactions or actions. The
existing datasets mostly label only one of the facets for model training, resulting in the video …

被引用次数：13 相关文章所有 8 个版本

Video understanding as machine translation

B Korbar, F Petroni, R Girdhar, L Torresani - arXiv preprint arXiv …, 2020 - arxiv.org

With the advent of large-scale multimodal video datasets, especially sequences with audio
or transcribed speech, there has been a growing interest in self-supervised learning of video …

被引用次数：29 相关文章所有 2 个版本

[PDF] arxiv.org

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：42 相关文章所有 3 个版本

[PDF] arxiv.org

To create what you tell: Generating videos from captions

Y Pan, Z Qiu, T Yao, H Li, T Mei - Proceedings of the 25th ACM …, 2017 - dl.acm.org

We are creating multimedia contents everyday and everywhere. While automatic content
generation has played a fundamental challenge to multimedia community for decades …

被引用次数：149 相关文章所有 4 个版本

[PDF] aaai.org

Token mixing: parameter-efficient transfer learning from image-language to video-language

Y Liu, L Xu, P Xiong, Q Jin - Proceedings of the AAAI Conference on …, 2023 - ojs.aaai.org

Applying large scale pre-trained image-language model to video-language tasks has
recently become a trend, which brings two challenges. One is how to effectively transfer …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos

R Gupta, A Roy, C Christensen, S Kim… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent growth in the consumption of online media by children during early childhood
necessitates data-driven tools enabling educators to filter out appropriate educational …

被引用次数：10 相关文章所有 6 个版本

[PDF] arxiv.org

Learning grounded vision-language representation for versatile understanding in untrimmed videos

T Wang, J Zhang, F Zheng, W Jiang, R Cheng… - arXiv preprint arXiv …, 2023 - arxiv.org

Joint video-language learning has received increasing attention in recent years. However,
existing works mainly focus on single or multiple trimmed video clips (events), which makes …

被引用次数：6 相关文章所有 2 个版本

高级搜索

QQ 群

On the effectiveness of task granularity for transfer learning

[PDF][PDF] Fine-grained video classification and captioning

Videobert: A joint model for video and language representation learning

Boosting video representation learning with multi-faceted integration

Video understanding as machine translation

Internvideo2: Scaling video foundation models for multimodal video understanding

To create what you tell: Generating videos from captions

Token mixing: parameter-efficient transfer learning from image-language to video-language

Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos

Learning grounded vision-language representation for versatile understanding in untrimmed videos

相关搜索

引用