Uncertainty-aware decisions in cloud computing: Foundations and future directions

HMD Kabir, A Khosravi, SK Mondal… - ACM Computing …, 2021 - dl.acm.org
The rapid growth of the cloud industry has increased challenges in the proper governance of
the cloud infrastructure. Many intelligent systems have been developing, considering …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

A survey of operating system support for persistent memory

M Cai, H Huang - Frontiers of Computer Science, 2021 - Springer
Emerging persistent memory technologies, like PCM and 3D XPoint, offer numerous
advantages, such as higher density, larger capacity, and better energy efficiency, compared …

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

Node failure resiliency for Uintah without checkpointing

D Sahasrabudhe, M Berzins… - … : Practice and Experience, 2019 - Wiley Online Library
The frequency of failures in upcoming exascale supercomputers may well be greater than at
present due to many‐core architectures if component failure rates remain unchanged. This …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Joint availability enhancement and traffic optimization of virtual cluster allocation in cloud datacenters

X Liu, B Cheng, S Wang, J Chen - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
As more and more services are deployed in the cloud datacenter, network traffic is growing
exponentially. Virtual machines (VMs) of a virtual cluster (VC) must be allocated on physical …

Enhancing Asynchronous Many-Task Runtime Systems for Next-Generation Architectures and Exascale Supercomputers

D Sahasrabudhe - 2021 - search.proquest.com
Exascale supercomputers capable of computing 10 18 double-precision floating point
operations per second are expected to be operational around 2022/23. The complexity and …

Optimal placement of in-memory checkpoints under heterogeneous failure likelihoods

Z Hussain, T Znati, R Melhem - 2019 IEEE International Parallel …, 2019 - ieeexplore.ieee.org
In-memory checkpointing has increased in popularity over the years because it significantly
improves the time to take a checkpoint. It is usually accomplished by placing all or part of a …