Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Resource-efficient Algorithms and Systems of Foundation Models: A Survey

M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2025 - dl.acm.org
Large foundation models, including large language models, vision transformers, diffusion,
and large language model based multimodal models, are revolutionizing the entire machine …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

New solutions on LLM acceleration, optimization, and application

Y Huang, LJ Wan, H Ye, M Jha, J Wang, Y Li… - Proceedings of the 61st …, 2024 - dl.acm.org
Large Language Models (LLMs) have revolutionized a wide range of applications with their
strong human-like understanding and creativity. Due to the continuously growing model size …

Llamaf: An efficient llama2 architecture accelerator on embedded fpgas

H Xu, Y Li, S Ji - 2024 IEEE 10th World Forum on Internet of …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have demonstrated remarkable abilities in natural language
processing. However, their deployment on resource-constrained embedded devices …

Efficient training and inference: Techniques for large language models using llama

SR Cunningham, D Archambault, A Kung - Authorea Preprints, 2024 - techrxiv.org
To enhance the efficiency of language models, it would involve optimizing their training and
inference processes to reduce computational demands while maintaining high performance …

Edgellm: A highly efficient cpu-fpga heterogeneous edge accelerator for large language models

M Huang, A Shen, K Li, H Peng, B Li, H Yu - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancements in artificial intelligence (AI), particularly the Large Language
Models (LLMs), have profoundly affected our daily work and communication forms …

A survey of small language models

C Van Nguyen, X Shen, R Aponte, Y Xia… - arXiv preprint arXiv …, 2024 - arxiv.org
Small Language Models (SLMs) have become increasingly important due to their efficiency
and performance to perform various language tasks with minimal computational resources …

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

E Kabir, MA Kabir, ARJ Downey, JD Bakos… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer neural networks (TNNs) are being applied across a widening range of
application domains, including natural language processing (NLP), machine translation, and …