A Survey on Trustworthy Edge Intelligence: From Security and Reliability To Transparency and Sustainability

X Wang, B Wang, Y Wu, Z Ning… - … Surveys & Tutorials, 2024 - ieeexplore.ieee.org
Edge Intelligence (EI) integrates Edge Computing (EC) and Artificial Intelligence (AI) to push
the capabilities of AI to the network edge for real-time, efficient and secure intelligent …

Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

Efficientqat: Efficient quantization-aware training for large language models

M Chen, W Shao, P Xu, J Wang, P Gao… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are integral to modern natural language processing and
artificial intelligence. However, they face challenges in managing their significant memory …

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

S Ma, C Fang, H Shao, Z Wang - arXiv preprint arXiv:2409.17870, 2024 - arxiv.org
Large language models (LLMs) have been widely applied but face challenges in efficient
inference. While quantization methods reduce computational demands, ultra-low bit …

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

X Luo, Y Wang, Q Zhu, Z Zhang, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid growth in the parameters of large language models (LLMs) has made inference
latency a fundamental bottleneck, limiting broader application of LLMs. Speculative …

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Y Ji, C Fang, S Ma, H Shao, Z Wang - arXiv preprint arXiv:2407.12070, 2024 - arxiv.org
Transformer models have revolutionized AI tasks, but their large size hinders real-world
deployment on resource-constrained and latency-critical edge devices. While binarized …

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

L Ma, M Sun, Z Shen - arXiv preprint arXiv:2407.07093, 2024 - arxiv.org
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for
the first time how to train a large-scale binary language model from scratch (not the partial …

VcLLM: Video Codecs are Secretly Tensor Codecs

C Xu, Y Wu, X Yang, B Chen, M Lentz, D Zhuo… - arXiv preprint arXiv …, 2024 - arxiv.org
As the parameter size of large language models (LLMs) continues to expand, the need for a
large memory footprint and high communication bandwidth have become significant …

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

H Wang, B Liu, H Shao, B Xiao, K Zeng, G Wan… - arXiv preprint arXiv …, 2024 - arxiv.org
Parameter quantization for Large Language Models (LLMs) has attracted increasing
attentions recently in reducing memory costs and improving computational efficiency. Early …

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

D Jo, T Kim, Y Kim, JJ Kim - arXiv preprint arXiv:2406.12311, 2024 - arxiv.org
Binarization, which converts weight parameters to binary values, has emerged as an
effective strategy to reduce the size of large language models (LLMs). However, typical …