The" something something" video database for learning and evaluating visual common sense

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] springer.com

Human activity recognition in artificial intelligence framework: a narrative review

N Gupta, SK Gupta, RK Pathak, V Jain… - Artificial intelligence …, 2022 - Springer

Human activity recognition (HAR) has multifaceted applications due to its worldly usage of
acquisition devices such as smartphones, video cameras, and its ability to capture human …

被引用次数：195 相关文章所有 10 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1004 相关文章所有 11 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：201 相关文章所有 7 个版本

[PDF] neurips.cc

Adaptformer: Adapting vision transformers for scalable visual recognition

S Chen, C Ge, Z Tong, J Wang… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Pretraining Vision Transformers (ViTs) has achieved great success in visual
recognition. A following scenario is to adapt a ViT to various image and video recognition …

被引用次数：352 相关文章所有 7 个版本

[PDF] neurips.cc

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

被引用次数：403 相关文章所有 5 个版本

[PDF] neurips.cc

Minedojo: Building open-ended embodied agents with internet-scale knowledge

L Fan, G Wang, Y Jiang, A Mandlekar… - Advances in …, 2022 - proceedings.neurips.cc

Autonomous agents have made great strides in specialist domains like Atari games and Go.
However, they typically learn tabula rasa in isolated environments with limited and manually …

被引用次数：258 相关文章所有 7 个版本

[PDF] neurips.cc

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc

Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

被引用次数：754 相关文章所有 6 个版本

[PDF] mlr.press

Scaling up and distilling down: Language-guided robot skill acquisition

H Ha, P Florence, S Song - Conference on Robot Learning, 2023 - proceedings.mlr.press

We present a framework for robot skill acquisition, which 1) efficiently scale up data
generation of language-labelled robot data and 2) effectively distills this data down into a …

被引用次数：63 相关文章所有 7 个版本

[PDF] arxiv.org

Seed-bench: Benchmarking multimodal llms with generative comprehension

B Li, R Wang, G Wang, Y Ge, Y Ge, Y Shan - arXiv preprint arXiv …, 2023 - arxiv.org

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large
Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting …

被引用次数：207 相关文章所有 2 个版本

高级搜索

QQ 群