作者
Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao
发表日期
2023/11/3
期刊
IEEE Transactions on Pattern Analysis and Machine Intelligence
出版商
IEEE
简介
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled to 1B parameters by taking the advantage of the scalable model capacity and high parallelism, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose++ model is proposed to deal with heterogeneous body keypoint …
引用总数
学术搜索中的文章
Y Xu, J Zhang, Q Zhang, D Tao - IEEE Transactions on Pattern Analysis and Machine …, 2023