Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org
While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

Cilantro:{Performance-Aware} resource allocation for general objectives via online feedback

R Bhardwaj, K Kandasamy, A Biswal, W Guo… - … USENIX Symposium on …, 2023 - usenix.org
Traditional systems for allocating finite cluster resources among competing jobs have either
aimed at providing fairness, relied on users to specify their resource requirements, or have …

Codec: Cost-effective duration prediction system for deadline scheduling in the cloud

H Li, M Ma, Y Liu, S Qin, B Qiao, R Yao… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Modern cloud platforms allow customers to flexibly allocate or release computing resources.
One crucial scenario is how to drive existing VMs to a specific state by a given deadline in a …

Smartpick: Workload Prediction for Serverless-enabled Scalable Data Analytics Systems

AD Mohapatra, K Oh - Proceedings of the 24th International Middleware …, 2023 - dl.acm.org
Many data analytic systems have adopted a newly emerging compute resource, serverless
(SL), to handle data analytics queries in a timely and cost-efficient manner, ie, serverless …

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

H Li, L Lu, W Shi, G Tan, H Luo - Computing, 2023 - Springer
Big data frameworks such as Storm, Spark and Hadoop are widely deployed in commercial
and research applications, the energy consumption of cloud data centers that support big …

Cougar: A General Framework for Jobs Optimization In Cloud

B Sang, S Gu, X Zhan, M Tang, J Liu… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
In the cloud environment, different kinds of jobs (Flink, PyTorch, TensorFlow, AI-Serving) are
running in the same cluster with different service-level agreements (SLA). To manage large …

[PDF][PDF] PolarisProfiler: A Novel Metadata-Based Profiling Approach for Optimizing Resource Management in the Edge-Cloud Continnum.

A Morichetta, V Casamayor-Pujol, S Nastic, S Dustdar… - SOSE, 2023 - dsg.tuwien.ac.at
Resource provisioning is vital in large-scale, geographically distributed, and hierarchically
organized infrastructures, and, at the same time, it represents one of the stiffest challenges in …

Cost-Intelligent Data Analytics in the Cloud

H Zhang, Y Liu, J Yan - arXiv preprint arXiv:2308.09569, 2023 - arxiv.org
For decades, database research has focused on optimizing performance under fixed
resources. As more and more database applications move to the public cloud, we argue that …

PB3Opt: Profile‐based biased Bayesian optimization to select computing clusters on the cloud

T Aparecida Silva Camacho… - Concurrency and …, 2023 - Wiley Online Library
Given the wide variety of cloud computing resources for creating high‐performance
computer clusters and their complex performance relationship with applications, finding the …

Performance models of data parallel DAG workflows for large scale data analytics

J Shi, J Lu - Distributed and Parallel Databases, 2023 - Springer
Abstract Directed Acyclic Graph (DAG) workflows are widely used for large-scale data
analytics in cluster-based distributed computing systems. The performance model for a DAG …