Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization

C Guo, J Tang, W Hu, J Leng, C Zhang… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …

Rptq: Reorder-based post-training quantization for large language models

Z Yuan, L Niu, J Liu, W Liu, X Wang, Y Shang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale language models (LLMs) have demonstrated impressive performance, but their
deployment presents challenges due to their significant memory usage. This issue can be …

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

C Guo, R Zhang, J Xu, J Leng, Z Liu, Z Huang… - Proceedings of the 29th …, 2024 - dl.acm.org
Large-scale deep neural networks (DNNs), such as large language models (LLMs), have
revolutionized the artificial intelligence (AI) field and become increasingly popular. However …

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Mapping

Z Liu, W Ni, J Leng, Y Feng, C Guo, Q Chen… - Proceedings of the 29th …, 2024 - dl.acm.org
Approximate nearest neighbor (ANN) search is a widely applied technique in modern
intelligent applications, such as recommendation systems and vector databases. Therefore …

MSD: Mixing Signed Digit Representations for Hardware-efficient DNN Acceleration on FPGA with Heterogeneous Resources

J Wu, J Zhou, Y Gao, Y Ding, N Wong… - 2023 IEEE 31st …, 2023 - ieeexplore.ieee.org
By quantizing weights with different precision for different parts of a network, mixed-precision
quantization promises to reduce the hardware cost and improve the speed of deep neural …

Dybit: Dynamic bit-precision numbers for efficient quantized neural network inference

J Zhou, J Wu, Y Gao, Y Ding, C Tao, B Li… - … on Computer-Aided …, 2023 - ieeexplore.ieee.org
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth
numbers is actively researched. A prominent challenge is to quantize the DNN models into …

Approximate computing survey, Part II: Application-specific & architectural approximation techniques and applications

V Leon, MA Hanif, G Armeniakos, X Jiao… - arXiv preprint arXiv …, 2023 - arxiv.org
The challenging deployment of compute-intensive applications from domains such Artificial
Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing …

Lut-nn: Empower efficient neural network inference with centroid learning and table lookup

X Tang, Y Wang, T Cao, LL Zhang, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org
On-device Deep Neural Network (DNN) inference consumes significant computing
resources and development efforts. To alleviate that, we propose LUT-NN, the first system to …

Nesting forward automatic differentiation for memory-efficient deep neural network training

C Guo, Y Qiu, J Leng, C Zhang, Y Cao… - 2022 IEEE 40th …, 2022 - ieeexplore.ieee.org
An activation function is an element-wise mathematical function and plays a crucial role in
deep neural networks (DNN). Many novel and sophisticated activation functions have been …

Accelerating sparse dnns based on tiled gemm

C Guo, F Xue, J Leng, Y Qiu, Y Guan… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Network pruning can reduce the computation cost of deep neural network (DNN) models.
However, sparse models often produce randomly-distributed weights to maintain accuracy …