Accelerating sparse deep neural networks

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：348 相关文章所有 3 个版本

[PDF] arxiv.org

A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier

Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

被引用次数：36 相关文章所有 6 个版本

[PDF] mlr.press

Sparsegpt: Massive language models can be accurately pruned in one-shot

E Frantar, D Alistarh - International Conference on Machine …, 2023 - proceedings.mlr.press

We show for the first time that large-scale generative pretrained transformer (GPT) family
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …

被引用次数：330 相关文章所有 8 个版本

[PDF] arxiv.org

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - arXiv preprint arXiv:2306.11695, 2023 - arxiv.org

As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …

被引用次数：258 相关文章所有 5 个版本

[PDF] neurips.cc

Optimal brain compression: A framework for accurate post-training quantization and pruning

E Frantar, D Alistarh - Advances in Neural Information …, 2022 - proceedings.neurips.cc

We consider the problem of model compression for deep neural networks (DNNs) in the
challenging one-shot/post-training setting, in which we are given an accurate trained model …

被引用次数：154 相关文章所有 5 个版本

[PDF] mlr.press

Group fisher pruning for practical network compression

L Liu, S Zhang, Z Kuang, A Zhou… - International …, 2021 - proceedings.mlr.press

Network compression has been widely studied since it is able to reduce the memory and
computation cost during inference. However, previous methods seldom deal with …

被引用次数：143 相关文章所有 7 个版本

[PDF] kuleuven.be

[PDF][PDF] Skeleton-of-thought: Large language models can do parallel decoding

X Ning, Z Lin, Z Zhou, Z Wang, H Yang… - Proceedings ENLSP …, 2023 - lirias.kuleuven.be

This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …

被引用次数：58 相关文章所有 6 个版本

[PDF] arxiv.org

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

E Kurtic, D Campos, T Nguyen, E Frantar… - arXiv preprint arXiv …, 2022 - arxiv.org

Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …

被引用次数：102 相关文章所有 3 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：43 相关文章所有 2 个版本

[PDF] neurips.cc

Pruning vs quantization: which is better?

A Kuzmin, M Nagel, M Van Baalen… - Advances in neural …, 2024 - proceedings.neurips.cc

Neural network pruning and quantization techniques are almost as old as neural networks
themselves. However, to date, only ad-hoc comparisons between the two have been …

被引用次数：21 相关文章所有 6 个版本

高级搜索

QQ 群