The lazy neuron phenomenon: On emergence of activation sparsity in transformers

Z Li, C You, S Bhojanapalli, D Li, AS Rawat… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper studies the curious phenomenon for machine learning models with Transformer
architectures that their activation maps are sparse. By activation map we refer to the …

Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing

J Alman, Z Song, R Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Over the last decade, deep neural networks have transformed our society, and they are
already widely applied in various machine learning applications. State-of-the-art deep …

Training multi-layer over-parametrized neural network in subquadratic time

Z Song, L Zhang, R Zhang - arXiv preprint arXiv:2112.07628, 2021 - arxiv.org
We consider the problem of training a multi-layer over-parametrized neural network to
minimize the empirical risk induced by a loss function. In the typical setting of over …

Efficient asynchronize stochastic gradient algorithm with structured data

Z Song, M Ye - arXiv preprint arXiv:2305.08001, 2023 - arxiv.org
Deep learning has achieved impressive success in a variety of fields because of its good
generalization. However, it has been a challenging problem to quickly train a neural network …