H2o: Heavy-hitter oracle for efficient generative inference of large language models

Z Zhang, Y Sheng, T Zhou, T Chen… - Advances in …, 2023 - proceedings.neurips.cc
Abstract Large Language Models (LLMs), despite their recent impressive accomplishments,
are notably cost-prohibitive to deploy, particularly for applications involving long-content …

Deja vu: Contextual sparsity for efficient llms at inference time

Z Liu, J Wang, T Dao, T Zhou, B Yuan… - International …, 2023 - proceedings.mlr.press
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …

Attention scheme inspired softmax regression

Y Deng, Z Li, Z Song - arXiv preprint arXiv:2304.10411, 2023 - arxiv.org
Large language models (LLMs) have made transformed changes for human society. One of
the key computation in LLMs is the softmax unit. This operation is important in LLMs …

A Nearly-Optimal Bound for Fast Regression with Guarantee

Z Song, M Ye, J Yin, L Zhang - International Conference on …, 2023 - proceedings.mlr.press
Given a matrix $ A\in\mathbb {R}^{n\times d} $ and a vector $ b\in\mathbb {R}^ n $, we
consider the regression problem with $\ell_\infty $ guarantees: finding a vector …

Training multi-layer over-parametrized neural network in subquadratic time

Z Song, L Zhang, R Zhang - arXiv preprint arXiv:2112.07628, 2021 - arxiv.org
We consider the problem of training a multi-layer over-parametrized neural network to
minimize the empirical risk induced by a loss function. In the typical setting of over …

Sketching for first order method: efficient algorithm for low-bandwidth channel and vulnerability

Z Song, Y Wang, Z Yu, L Zhang - … Conference on Machine …, 2023 - proceedings.mlr.press
Sketching is one of the most fundamental tools in large-scale machine learning. It enables
runtime and memory saving via randomly compressing the original large problem into lower …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Gradientcoin: A peer-to-peer decentralized large language models

Y Gao, Z Song, J Yin - arXiv preprint arXiv:2308.10502, 2023 - arxiv.org
Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally
changed the economic system over the last decade. Since 2022, large language models …

Hsr-enhanced sparse attention acceleration

B Chen, Y Liang, Z Sha, Z Shi, Z Song - arXiv preprint arXiv:2410.10165, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
applications, but their performance on long-context tasks is often limited by the …

Infoprompt: Information-theoretic soft prompt tuning for natural language understanding

J Wu, T Yu, R Wang, Z Song, R Zhang… - Advances in …, 2024 - proceedings.neurips.cc
Soft prompt tuning achieves superior performances across a wide range of few-shot tasks.
However, the performances of prompt tuning can be highly sensitive to the initialization of …