Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Human activity recognition in artificial intelligence framework: a narrative review

N Gupta, SK Gupta, RK Pathak, V Jain… - Artificial intelligence …, 2022 - Springer
Human activity recognition (HAR) has multifaceted applications due to its worldly usage of
acquisition devices such as smartphones, video cameras, and its ability to capture human …

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Adaptformer: Adapting vision transformers for scalable visual recognition

S Chen, C Ge, Z Tong, J Wang… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Pretraining Vision Transformers (ViTs) has achieved great success in visual
recognition. A following scenario is to adapt a ViT to various image and video recognition …

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

Minedojo: Building open-ended embodied agents with internet-scale knowledge

L Fan, G Wang, Y Jiang, A Mandlekar… - Advances in …, 2022 - proceedings.neurips.cc
Autonomous agents have made great strides in specialist domains like Atari games and Go.
However, they typically learn tabula rasa in isolated environments with limited and manually …

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

Scaling up and distilling down: Language-guided robot skill acquisition

H Ha, P Florence, S Song - Conference on Robot Learning, 2023 - proceedings.mlr.press
We present a framework for robot skill acquisition, which 1) efficiently scale up data
generation of language-labelled robot data and 2) effectively distills this data down into a …

Seed-bench: Benchmarking multimodal llms with generative comprehension

B Li, R Wang, G Wang, Y Ge, Y Ge, Y Shan - arXiv preprint arXiv …, 2023 - arxiv.org
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large
Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting …