models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal
loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically
designed to work efficiently and accurately on massive GPT-family models. We can execute
SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in
under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in …