Mvitv2: Improved multiscale vision transformers for classification and detection

AA Aleissaee, A Kumar, RM Anwer, S Khan… - Remote Sensing, 2023 - mdpi.com

Deep learning-based algorithms have seen a massive popularity in different areas of remote
sensing image analysis over the past decade. Recently, transformer-based architectures …

被引用次数：118 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Deep learning based computer vision approaches for smart agricultural applications

VG Dhanya, A Subeesh, NL Kushwaha… - Artificial Intelligence in …, 2022 - Elsevier

The agriculture industry is undergoing a rapid digital transformation and is growing powerful
by the pillars of cutting-edge approaches like artificial intelligence and allied technologies …

被引用次数：119 相关文章所有 5 个版本

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

被引用次数：365 相关文章所有 5 个版本

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

被引用次数：441 相关文章所有 5 个版本

[PDF] thecvf.com

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

被引用次数：270 相关文章所有 6 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：200 相关文章所有 7 个版本

[PDF] neurips.cc

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

被引用次数：402 相关文章所有 5 个版本

[PDF] arxiv.org

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

被引用次数：424 相关文章所有 3 个版本

[PDF] neurips.cc

Vitpose: Simple vision transformer baselines for human pose estimation

Y Xu, J Zhang, Q Zhang, D Tao - Advances in Neural …, 2022 - proceedings.neurips.cc

Although no specific domain knowledge is considered in the design, plain vision
transformers have shown excellent performance in visual recognition tasks. However, little …

被引用次数：423 相关文章所有 5 个版本

[PDF] neurips.cc

Hornet: Efficient high-order spatial interactions with recursive gated convolutions

Y Rao, W Zhao, Y Tang, J Zhou… - Advances in Neural …, 2022 - proceedings.neurips.cc

Recent progress in vision Transformers exhibits great success in various tasks driven by the
new spatial modeling mechanism based on dot-product self-attention. In this paper, we …

被引用次数：236 相关文章所有 5 个版本

高级搜索

QQ 群