作者
Chongzhi Zhang#, Mingyuan Zhang#, Shanghang Zhang#, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Shuai Yi, Xianglong Liu, and Ziwei Liu
发表日期
2022
研讨会论文
Conference on Computer Vision and Pattern Recognition (CVPR)
简介
Recently, Vision Transformers have achieved impressive results on various Vision tasks. Yet, their generalization ability under different distribution shifts is poorly understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of Vision Transformers. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual levels: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of Vision Transformer variants under different levels of distribution shifts and compare their generalization ability with Convolutional Neural Network (CNN) models. Several important observations are obtained: 1) Vision Transformers generalize better than CNNs under multiple distribution shifts. With the same or less amount of parameters, Vision Transformers are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of distribution shift. In particular, Vision Transformers lead by more than 10% under the corruption shifts. 2) larger Vision Transformers gradually narrow the in-distribution (ID) and out-of-distribution (OOD) performance gap. To further improve the generalization of Vision Transformers, we design the enhanced Vision Transformers through self-supervised learning, information theory, and adversarial learning. By investigating these three types of generalization-enhanced Transformers, we observe the gradient-sensitivity of Vision Transformers and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve …
引用总数
学术搜索中的文章
C Zhang, M Zhang, S Zhang, D Jin, Q Zhou, Z Cai… - Proceedings of the IEEE/CVF conference on Computer …, 2022