Just one byte (per gradient): A note on low-bandwidth decentralized language model finetuning using shared randomness

E Zelikman, Q Huang, P Liang, N Haber… - arXiv preprint arXiv …, 2023 - arxiv.org
Language model training in distributed settings is limited by the communication cost of
gradient exchanges. In this short note, we extend recent work from Malladi et al.(2023) …

CO2: Efficient distributed training with full communication-computation overlap

W Sun, Z Qin, W Sun, S Li, D Li, X Shen, Y Qiao… - arXiv preprint arXiv …, 2024 - arxiv.org
The fundamental success of large language models hinges upon the efficacious
implementation of large-scale distributed training techniques. Nevertheless, building a vast …

Asynchronous Local-SGD Training for Language Modeling

B Liu, R Chhaparia, A Douillard, S Kale… - arXiv preprint arXiv …, 2024 - arxiv.org
Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is
an approach to distributed optimization where each device performs more than one SGD …

Diloco: Distributed low-communication training of language models

A Douillard, Q Feng, AA Rusu, R Chhaparia… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLM) have become a critical component in many applications of
machine learning. However, standard approaches to training LLM require a large number of …

Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes

Z Qin, D Chen, B Qian, B Ding, Y Li, S Deng - arXiv preprint arXiv …, 2023 - arxiv.org
Pre-trained large language models (LLMs) require fine-tuning to improve their
responsiveness to natural language instructions. Federated learning (FL) offers a way to …

SLoRA: Federated parameter efficient fine-tuning of language models

S Babakniya, AR Elkordy, YH Ezzeldin, Q Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Transfer learning via fine-tuning pre-trained transformer models has gained significant
success in delivering state-of-the-art results across various NLP tasks. In the absence of …

Distributed inference and fine-tuning of large language models over the internet

A Borzunov, M Ryabinin… - Advances in …, 2024 - proceedings.neurips.cc
Large language models (LLMs) are useful in many NLP tasks and become more capable
with size, with the best open-source models having over 50 billion parameters. However …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

A better alternative to error feedback for communication-efficient distributed learning

S Horváth, P Richtárik - arXiv preprint arXiv:2006.11077, 2020 - arxiv.org
Modern large-scale machine learning applications require stochastic optimization
algorithms to be implemented on distributed compute systems. A key bottleneck of such …

On the convergence of zeroth-order federated tuning for large language models

Z Ling, D Chen, L Yao, Y Li, Y Shen - Proceedings of the 30th ACM …, 2024 - dl.acm.org
The confluence of Federated Learning (FL) and Large Language Models (LLMs) is ushering
in a new era in privacy-preserving natural language processing. However, the intensive …