H Sheen, S Chen, T Wang, HH Zhou - arXiv preprint arXiv:2403.08699, 2024 - arxiv.org
We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately …