Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional …
E Frantar, D Alistarh - International Conference on Machine …, 2023 - proceedings.mlr.press
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …
As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to …
E Frantar, D Alistarh - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model …
Network compression has been widely studied since it is able to reduce the memory and computation cost during inference. However, previous methods seldom deal with …
This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding …
Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and …
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date, only ad-hoc comparisons between the two have been …