A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations

H Cheng, M Zhang, JQ Shi - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …

A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu
Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

Compact language models via pruning and knowledge distillation

S Muralidharan, ST Sreenivas, RB Joshi… - The Thirty-eighth …, 2024 - openreview.net
Large language models (LLMs) targeting different deployment scales and sizes are currently
produced by training each variant from scratch; this is extremely compute-intensive. In this …

Bk-sdm: A lightweight, fast, and cheap version of stable diffusion

BK Kim, HK Song, T Castells, S Choi - European Conference on Computer …, 2025 - Springer
Abstract Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high
computing demands due to billion-scale parameters. To enhance efficiency, recent studies …

A Review on Edge Large Language Models: Design, Execution, and Applications

Y Zheng, Y Chen, B Qian, X Shi, Y Shu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have revolutionized natural language processing with their
exceptional capabilities. However, deploying LLMs on resource-constrained edge devices …

Transformer layers as painters

Q Sun, M Pickett, AK Nain, L Jones - arXiv preprint arXiv:2407.09298, 2024 - arxiv.org
Despite their nearly universal adoption for large language models, the internal workings of
transformers are not well understood. We aim to better understand the impact of removing or …

HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models

RS Sukthanker, A Zela, B Staffler, A Klein… - arXiv preprint arXiv …, 2024 - arxiv.org
The increasing size of language models necessitates a thorough analysis across multiple
dimensions to assess trade-offs among crucial hardware metrics such as latency, energy …

Mixture-of-modules: Reinventing transformers as dynamic assemblies of modules

Z Gong, A Lv, J Guan, J Yan, W Wu, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Is it always necessary to compute tokens from shallow to deep layers in Transformers? The
continued success of vanilla Transformers and their variants suggests an undoubted" yes" …

Laptop-diff: Layer pruning and normalized distillation for compressing diffusion models

D Zhang, S Li, C Chen, Q Xie, H Lu - arXiv preprint arXiv:2404.11098, 2024 - arxiv.org
In the era of AIGC, the demand for low-budget or even on-device applications of diffusion
models emerged. In terms of compressing the Stable Diffusion models (SDMs), several …

A deeper look at depth pruning of LLMs

SA Siddiqui, X Dong, G Heinrich, T Breuel… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are not only resource-intensive to train but even more
costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs …