A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

Scan and snap: Understanding training dynamics and token composition in 1-layer transformer

Y Tian, Y Wang, B Chen, SS Du - Advances in Neural …, 2023 - proceedings.neurips.cc
Transformer architecture has shown impressive performance in multiple research domains
and has become the backbone of many neural network models. However, there is limited …

Yarn: Efficient context window extension of large language models

B Peng, J Quesnelle, H Fan, E Shippole - arXiv preprint arXiv:2309.00071, 2023 - arxiv.org
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional
information in transformer-based language models. However, these models fail to …

Lm-infinite: Simple on-the-fly length generalization for large language models

C Han, Q Wang, W Xiong, Y Chen, H Ji… - arXiv preprint arXiv …, 2023 - arxiv.org
In recent years, there have been remarkable advancements in the performance of
Transformer-based Large Language Models (LLMs) across various domains. As these LLMs …

Griffin: Mixing gated linear recurrences with local attention for efficient language models

S De, SL Smith, A Fernando, A Botev… - arXiv preprint arXiv …, 2024 - arxiv.org
Recurrent neural networks (RNNs) have fast inference and scale efficiently on long
sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with …

Leave no context behind: Efficient infinite context transformers with infini-attention

T Munkhdalai, M Faruqui, S Gopal - arXiv preprint arXiv:2404.07143, 2024 - arxiv.org
This work introduces an efficient method to scale Transformer-based Large Language
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …

Repeat after me: Transformers are better than state space models at copying

S Jelassi, D Brandfonbrener, SM Kakade… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers are the dominant architecture for sequence modeling, but there is growing
interest in models that use a fixed-size latent state that does not depend on the sequence …