Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

J Duan, Z Song, X Miao, X Xi, D Lin, H Xu… - … USENIX Symposium on …, 2024 - usenix.org
Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …

Unlocking unallocated cloud capacity for long, uninterruptible workloads

A Agarwal, S Noghabi, Í Goiri, S Seshan… - 20th USENIX Symposium …, 2023 - usenix.org
Cloud providers auction off unallocated resources at a low cost to avoid keeping hardware
idle. One such mechanism is Harvest VMs (HVMs). These VMs grow and shrink as the …

Making Cloud Spot Instance Interruption Events Visible

KH Kim, K Lee - Proceedings of the ACM on Web Conference 2024, 2024 - dl.acm.org
Public cloud computing providers offer a surplus of computing resources at a lower price
with a service of a spot instance. Despite the possible great cost savings from using spot …

[图书][B] Enhancing Molecular Dynamics Simulations with Machine Learning and Advanced Cyberinfrastructure

JCSK Kadupitige - 2022 - search.proquest.com
Molecular dynamics simulations accelerated by high-performance computing methods are
powerful tools for investigating and extracting the microscopic mechanisms characterizing …

[PDF][PDF] Leveraging spot instances for resource provisioning in serverless computing

JP Valencia Gómez - 2023 - aaltodoc.aalto.fi
Our system achieves significant cost savings: assuming a function execution time of two
minutes, our system has the same price as the Cloud Run solution at around 8,000 requests …

Fault-Tolerance in Distributed DNN Training & Inference

X Xiaoli - 2023 - search.proquest.com
Deep neural networks (DNNs) are progressively becoming larger and very costly to train.
Additionally, the fast-rising use of large models are demanding much faster inference …

A Novel Multilevel Cost Effective Fault Tolerance (CEFT] Framework Approach for High Performance Computing [HPC] Cloud

K Sharavana, JP Kumar - NeuroQuantology, 2022 - search.proquest.com
Cloud with HPC is capable of handling large applications like scientific workflow on scalable
and powerful hardware without owning or maintaining it Although cloud computing is …