Powerinfer: Fast large language model serving with a consumer-grade gpu

Y Song, Z Mi, H Xie, H Chen - Proceedings of the ACM SIGOPS 30th …, 2024 - dl.acm.org
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference
engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Museformer: Transformer with fine-and coarse-grained attention for music generation

B Yu, P Lu, R Wang, W Hu, X Tan… - Advances in …, 2022 - proceedings.neurips.cc
Symbolic music generation aims to generate music scores automatically. A recent trend is to
use Transformer or its variants in music generation, which is, however, suboptimal, because …

Sparsetir: Composable abstractions for sparse compilation in deep learning

Z Ye, R Lai, J Shao, T Chen, L Ceze - Proceedings of the 28th ACM …, 2023 - dl.acm.org
Sparse tensors are rapidly becoming critical components of modern deep learning
workloads. However, developing high-performance sparse operators can be difficult and …

Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

H Xia, Z Zheng, Y Li, D Zhuang, Z Zhou, X Qiu… - arXiv preprint arXiv …, 2023 - arxiv.org
With the fast growth of parameter size, it becomes increasingly challenging to deploy large
generative models as they typically require large GPU memory consumption and massive …

Optimizing dynamic neural networks with brainstorm

W Cui, Z Han, L Ouyang, Y Wang, N Zheng… - … USENIX Symposium on …, 2023 - usenix.org
Dynamic neural networks (NNs), which can adapt sparsely activated sub-networks to inputs
during inference, have shown significant advantages over static ones in terms of accuracy …

Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation

N Zheng, H Jiang, Q Zhang, Z Han, L Ma… - Proceedings of the 29th …, 2023 - dl.acm.org
Dynamic sparsity, where the sparsity patterns are unknown until runtime, poses a significant
challenge to deep learning. The state-of-the-art sparsity-aware deep learning solutions are …

Register Tiling for Unstructured Sparsity in Neural Network Inference

L Wilkinson, K Cheshmi, MM Dehnavi - Proceedings of the ACM on …, 2023 - dl.acm.org
Unstructured sparse neural networks are an important class of machine learning (ML)
models, as they compact model size and reduce floating point operations. The execution …

Looplets: A language for structured coiteration

W Ahrens, D Donenfeld, F Kjolstad… - Proceedings of the 21st …, 2023 - dl.acm.org
Real world arrays often contain underlying structure, such as sparsity, runs of repeated
values, or symmetry. Specializing for structure yields significant speedups. But automatically …