Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

J Jiang, W Huang, M Zhang, T Suzuki, L Nie - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have demonstrated great power in the recent development of large
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention

H Sheen, S Chen, T Wang, HH Zhou - arXiv preprint arXiv:2403.08699, 2024 - arxiv.org
We study gradient flow on the exponential loss for a classification problem with a one-layer
softmax attention model, where the key and query weight matrices are trained separately …