An empirical study of training self-supervised vision transformers

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks with different data modalities. A PFM (eg, BERT, ChatGPT, and GPT-4) is …

被引用次数：413 相关文章所有 2 个版本

[PDF] springer.com

Attention mechanisms in computer vision: A survey

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer

Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

被引用次数：1392 相关文章所有 8 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1032 相关文章所有 11 个版本

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

被引用次数：452 相关文章所有 5 个版本

[PDF] thecvf.com

Convnext v2: Co-designing and scaling convnets with masked autoencoders

S Woo, S Debnath, R Hu, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Driven by improved architectures and better representation learning frameworks, the field of
visual recognition has enjoyed rapid modernization and performance boost in the early …

被引用次数：358 相关文章所有 8 个版本

[PDF] thecvf.com

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

被引用次数：273 相关文章所有 6 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：383 相关文章所有 9 个版本

[PDF] neurips.cc

Adaptformer: Adapting vision transformers for scalable visual recognition

S Chen, C Ge, Z Tong, J Wang… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Pretraining Vision Transformers (ViTs) has achieved great success in visual
recognition. A following scenario is to adapt a ViT to various image and video recognition …

被引用次数：357 相关文章所有 7 个版本

[PDF] arxiv.org

Glm-130b: An open bilingual pre-trained model

A Zeng, X Liu, Z Du, Z Wang, H Lai, M Ding… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model
with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as …

被引用次数：363 相关文章所有 5 个版本

[PDF] nature.com

A foundation model for generalizable disease detection from retinal images

Y Zhou, MA Chia, SK Wagner, MS Ayhan… - Nature, 2023 - nature.com

Medical artificial intelligence (AI) offers great potential for recognizing signs of health
conditions in retinal images and expediting the diagnosis of eye diseases and systemic …

被引用次数：146 相关文章所有 24 个版本

高级搜索

QQ 群