generalization behavior we observe in neural networks. In this work, we demonstrate that
non-stochastic full-batch training can achieve comparably strong performance to SGD on
CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of
SGD can be completely replaced with explicit regularization even when comparing against a
strong and well-researched baseline. Our observations indicate that the perceived difficulty …