A Review on Edge Large Language Models: Design, Execution, and Applications

Y Zheng, Y Chen, B Qian, X Shi, Y Shu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have revolutionized natural language processing with their
exceptional capabilities. However, deploying LLMs on resource-constrained edge devices …

T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge

J Wei, S Cao, T Cao, L Ma, L Wang, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The deployment of Large Language Models (LLMs) on edge devices is increasingly
important to enhance on-device intelligence. Weight quantization is crucial for reducing the …

AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments

C Fang, S Liu, Z Zhou, B Guo, J Tang, K Ma… - Proceedings of the 22nd …, 2024 - dl.acm.org
On-device adapting to continual, unpredictable domain shifts is essential for mobile
applications like autonomous driving and augmented reality to deliver seamless user …

Gptvq: The blessing of dimensionality for llm quantization

M van Baalen, A Kuzmin, M Nagel, P Couperus… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work we show that the size versus accuracy trade-off of neural network quantization
can be significantly improved by increasing the quantization dimensionality. We propose the …

Anatomizing Deep Learning Inference in Web Browsers

Q Wang, S Jiang, Z Chen, X Cao, Y Li, A Li… - ACM Transactions on …, 2024 - dl.acm.org
Web applications have increasingly adopted Deep Learning (DL) through in-browser
inference, wherein DL inference performs directly within Web browsers. The actual …

Multiplication-Free Lookup-Based CNN Accelerator using Residual Vector Quantization and Its FPGA Implementation

H Fuketa, T Katashita, Y Hori, M Hioki - IEEE Access, 2024 - ieeexplore.ieee.org
In this paper, a table lookup-based computing technique is proposed to perform
convolutional neural network (CNN) inference without multiplication, and its FPGA …

Turbocharge Speech Understanding with Pilot Inference

R Wang, FX Lin - Proceedings of the 30th Annual International …, 2024 - dl.acm.org
Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice
input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by …

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

C Li, Z Zhou, Y Wang, F Yang, T Cao, M Yang… - Proceedings of the 29th …, 2024 - dl.acm.org
DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in
recent years. However, their integration for deep learning acceleration poses inherent …

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

G Li, S Ye, C Chen, Y Wang, F Yang, T Cao… - arXiv preprint arXiv …, 2025 - arxiv.org
The emergence of neural network capabilities invariably leads to a significant surge in
computational demands due to expanding model sizes and increased computational …

[HTML][HTML] InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention Macro

P Feng, Y Chen, J Yu, H Yue, Z Jiang, Y Xiao, W Xiao… - Applied Sciences, 2024 - mdpi.com
Large Language Models (LLMs), based on transformer architecture, have demonstrated
remarkable capabilities in natural language processing tasks, enabling machines to …