Simple and scalable strategies to continually pre-train large language models

A Ibrahim, B Thérien, K Gupta, ML Richter… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start
the process over again once new data becomes available. A much more efficient solution is …

Addressing loss of plasticity and catastrophic forgetting in continual learning

M Elsayed, AR Mahmood - arXiv preprint arXiv:2404.00781, 2024 - arxiv.org
Deep representation learning methods struggle with continual learning, suffering from both
catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful …

Continual learning under language shift

E Gogoulou, T Lesort, M Boman, J Nivre - International Conference on …, 2024 - Springer
The recent increase in data and model scale for language model pre-training has led to
huge training costs. In scenarios where new data become available over time, updating a …

Utility-based perturbed gradient descent: An optimizer for continual learning

M Elsayed, AR Mahmood - arXiv preprint arXiv:2302.03281, 2023 - arxiv.org
Modern representation learning methods often struggle to adapt quickly under non-
stationarity because they suffer from catastrophic forgetting and decaying plasticity. Such …

Knowledge accumulation in continually learned representations and the issue of feature forgetting

T Hess, E Verwimp, GM van de Ven… - arXiv preprint arXiv …, 2023 - arxiv.org
Continual learning research has shown that neural networks suffer from catastrophic
forgetting" at the output level", but it is debated whether this is also the case at the level of …

Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example Associations

X Jin, X Ren - arXiv preprint arXiv:2406.14026, 2024 - arxiv.org
Language models (LMs) are known to suffer from forgetting of previously learned examples
when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating …

The shifting landscape of data: learning to tame distributional shifts

A Ibrahim - 2024 - papyrus.bib.umontreal.ca
Machine learning (ML) models achieve remarkable performance on tasks they are trained
for. However, they often are sensitive to shifts in the data distribution, which may lead to …

Demystifying Language Model Forgetting with Low-Rank Example Associations

X Jin, X Ren - NeurIPS 2024 Workshop on Scalable Continual … - openreview.net
Large Language models (LLMs) suffer from forgetting of upstream data when fine-tuned.
Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten …