S Li, Y Zhao, R Varma, O Salpekar, P Noordhuis… - arXiv preprint arXiv …, 2020 - arxiv.org
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in …
W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
Containers are widely used for resource management in datacenters. A common practice to support deep learning (DL) training in container clouds is to statically bind GPUs to …
Web applications rely heavily on software caches to achieve low-latency, high-throughput services. To adapt to changing workloads, three types of learned caches (learned evictions) …
M Yu, T Cao, W Wang, R Chen - 20th USENIX Symposium on …, 2023 - usenix.org
Serverless applications are typically composed of function workflows in which multiple short- lived functions are triggered to exchange data in response to events or state changes …
Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all …
Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally …
Large-scale computations are ubiquitous and demand exorbitant resources, with matrix multiplication being a prominent example. Multiplying high-dimensional matrices is …