Llm-based edge intelligence: A comprehensive survey on architectures, applications, security and trustworthiness

O Friha, MA Ferrag, B Kantarci… - IEEE Open Journal …, 2024 - ieeexplore.ieee.org
The integration of Large Language Models (LLMs) and Edge Intelligence (EI) introduces a
groundbreaking paradigm for intelligent edge devices. With their capacity for human-like …

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …

Kvquant: Towards 10 million context length llm inference with kv cache quantization

C Hooper, S Kim, H Mohammadzadeh… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs are seeing growing use for applications such as document analysis and
summarization which require large context windows, and with these large context windows …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Retrievalattention: Accelerating long-context llm inference via vector retrieval

D Liu, M Chen, B Lu, H Jiang, Z Han, Q Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based large Language Models (LLMs) become increasingly important in
various domains. However, the quadratic time complexity of attention operation poses a …

Post-Training Sparse Attention with Double Sparsity

S Yang, Y Sheng, JE Gonzalez, I Stoica… - arXiv preprint arXiv …, 2024 - arxiv.org
The inference process for large language models is slow and memory-intensive, with one of
the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper …

UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

Z Li, J Xiong, F Ye, C Zheng, X Wu, J Lu, Z Wan… - arXiv preprint arXiv …, 2024 - arxiv.org
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented
Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to …

LoCoCo: Dropping In Convolutions for Long Context Compression

R Cai, Y Tian, Z Wang, B Chen - arXiv preprint arXiv:2406.05317, 2024 - arxiv.org
This paper tackles the memory hurdle of processing long context sequences in Large
Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for …

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Q Zhu, J Duan, C Chen, S Liu, X Li, G Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) now support extremely long context windows, but the
quadratic complexity of vanilla attention results in significantly long Time-to-First-Token …