Ava: A video dataset of spatio-temporally localized atomic visual actions

N Gupta, SK Gupta, RK Pathak, V Jain… - Artificial intelligence …, 2022 - Springer

Human activity recognition (HAR) has multifaceted applications due to its worldly usage of
acquisition devices such as smartphones, video cameras, and its ability to capture human …

被引用次数：182 相关文章所有 10 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：96 相关文章所有 4 个版本

[PDF] neurips.cc

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

被引用次数：378 相关文章所有 4 个版本

[PDF] thecvf.com

Diffusiondet: Diffusion model for object detection

S Chen, P Sun, Y Song, P Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …

被引用次数：288 相关文章所有 5 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：178 相关文章所有 7 个版本

[PDF] neurips.cc

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc

Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

被引用次数：696 相关文章所有 6 个版本

[PDF] arxiv.org

Motiondiffuse: Text-driven human motion generation with diffusion model

M Zhang, Z Cai, L Pan, F Hong, X Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Human motion modeling is important for many modern graphics applications, which typically
require professional skills. In order to remove the skill barriers for laymen, recent motion …

被引用次数：306 相关文章所有 6 个版本

[PDF] thecvf.com

Masked feature prediction for self-supervised visual pre-training

C Wei, H Fan, S Xie, CY Wu, A Yuille… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training
of video models. Our approach first randomly masks out a portion of the input sequence and …

被引用次数：551 相关文章所有 7 个版本

[PDF] thecvf.com

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：581 相关文章所有 8 个版本

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

被引用次数：628 相关文章所有 16 个版本

高级搜索

QQ 群