You only cache once: Decoder-decoder architectures for language models - 学术资源搜索

文章

学术资源搜索

获得 7 条结果（用时0.02秒）

我的图书馆

You only cache once: Decoder-decoder architectures for language models

在引用文章中搜索

[PDF] usenix.org

{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training

Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li… - … USENIX Symposium on …, 2024 - usenix.org

With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Tensor attention training: Provably efficient learning of higher-order transformers

J Gu, Y Liang, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2405.16411, 2024 - arxiv.org

Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org

The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

H Wang, S Ma, R Wang, F Wei - arXiv preprint arXiv:2407.10969, 2024 - arxiv.org

We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large
language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can …

相关文章所有 2 个版本

[PDF] arxiv.org

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

S Yang, B Wang, Y Zhang, Y Shen, Y Kim - arXiv preprint arXiv …, 2024 - arxiv.org

Transformers with linear attention (ie, linear transformers) and state-space models have
recently been suggested as a viable linear-time alternative to transformers with softmax …

相关文章所有 2 个版本

[PDF] arxiv.org

: Language Modeling with Explicit Memory

H Yang, Z Lin, W Wang, H Wu, Z Li, B Tang… - arXiv preprint arXiv …, 2024 - arxiv.org

The training and inference of large language models (LLMs) are together a costly process
that transports knowledge from raw data to meaningful computation. Inspired by the memory …

相关文章所有 4 个版本

[PDF] arxiv.org

FocusLLM: Scaling LLM's Context by Parallel Decoding

Z Li, Y Zhang, T Pan, Y Sun, Z Duan, J Fang… - arXiv preprint arXiv …, 2024 - arxiv.org

Empowering LLMs with the ability to utilize useful information from a long context is crucial
for many downstream applications. However, achieving long context lengths with the …

相关文章所有 2 个版本