{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training

Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li… - … USENIX Symposium on …, 2024 - usenix.org
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …

Tensor attention training: Provably efficient learning of higher-order transformers

J Gu, Y Liang, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2405.16411, 2024 - arxiv.org
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

H Wang, S Ma, R Wang, F Wei - arXiv preprint arXiv:2407.10969, 2024 - arxiv.org
We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large
language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can …

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

S Yang, B Wang, Y Zhang, Y Shen, Y Kim - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers with linear attention (ie, linear transformers) and state-space models have
recently been suggested as a viable linear-time alternative to transformers with softmax …

: Language Modeling with Explicit Memory

H Yang, Z Lin, W Wang, H Wu, Z Li, B Tang… - arXiv preprint arXiv …, 2024 - arxiv.org
The training and inference of large language models (LLMs) are together a costly process
that transports knowledge from raw data to meaningful computation. Inspired by the memory …

FocusLLM: Scaling LLM's Context by Parallel Decoding

Z Li, Y Zhang, T Pan, Y Sun, Z Duan, J Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
Empowering LLMs with the ability to utilize useful information from a long context is crucial
for many downstream applications. However, achieving long context lengths with the …