Clover: Toward sustainable ai with carbon-aware machine learning inference service

B Li, S Samsi, V Gadepally, D Tiwari - Proceedings of the International …, 2023 - dl.acm.org
This paper presents a solution to the challenge of mitigating carbon emissions from hosting
large-scale machine learning (ML) inference services. ML inference is critical to modern …

Kairos: Building cost-efficient machine learning inference systems with heterogeneous cloud resources

B Li, S Samsi, V Gadepally, D Tiwari - Proceedings of the 32nd …, 2023 - dl.acm.org
Online inference is becoming a key service product for many businesses, deployed in cloud
platforms to meet customer demands. Despite their revenue-generation capability, these …

Tcb: Accelerating transformer inference services with request concatenation

B Fu, F Chen, P Li, D Zeng - … of the 51st International Conference on …, 2022 - dl.acm.org
Transformer has dominated the field of natural language processing because of its strong
capability in learning from sequential input data. In recent years, various computing and …

Characterizing multi-instance gpu for machine learning workloads

B Li, V Gadepally, S Samsi… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
As machine learning (ML) becomes more and more popular, datacenter operators use
hardware accelerators such as GPUs to tackle the high computation demand of ML …

Smlt: A serverless framework for scalable and adaptive machine learning design and training

A Ali, S Zawad, P Aditya, IE Akkus, R Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
In today's production machine learning (ML) systems, models are continuously trained,
improved, and deployed. ML design and training are becoming a continuous workflow of …

Graph3PO: A Temporal Graph Data Processing Method for Latency QoS Guarantee in Object Cloud Storage System

W Zhang, Z Shi, Z Liao, Y Li, Y Du, Y Wu… - Proceedings of the …, 2023 - dl.acm.org
Object cloud storage systems are deployed with diverse applications that have varying
latency service level objectives (SLOs), posting challenges for supporting quality of service …

Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing

S Cai, Z Zhou, K Zhao, X Chen - Proceedings of the 14th ACM SIGOPS …, 2023 - dl.acm.org
With the emerging of machine learning, many commercial companies increasingly utilize
machine learning inference systems as backend services to improve their products …

Dash: Scheduling deep learning workloads on multi-generational gpu-accelerated clusters

B Li, T Patel, V Gadepally, K Gettings… - 2022 IEEE High …, 2022 - ieeexplore.ieee.org
Two notable characteristics of modern GPU-accelerated HPC clusters are:(1) they
increasingly run deep learning (DL) model-training workloads, and (2) they consist of …

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

X Hui, Y Xu, Z Guo, X Shen - arXiv preprint arXiv:2404.16812, 2024 - arxiv.org
Recent years have witnessed increasing interest in machine learning inferences on
serverless computing for its auto-scaling and cost effective properties. Existing serverless …

mSIRM: Cost-Efficient and SLO-aware ML Load Balancing on Fog and Multi-Cloud Network

C Phalak, D Chahal, M Ramesh… - … of the 13th Workshop on AI …, 2023 - dl.acm.org
The use of intelligent sensors and edge devices has grown exponentially for automation in
the industry to hyper-personalize applications, minimize cost, improve efficiency, and …