Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

Systematically inferring I/O performance variability by examining repetitive job behavior

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

Fine-grained scheduling for containerized hpc workloads in kubernetes clusters

P Liu, J Guitart - 2022 IEEE 24th Int Conf on High Performance …, 2022 - ieeexplore.ieee.org
Containerization technology offers lightweight OS-level virtualization, and enables
portability, reproducibility, and flexibility by packing applications with low performance …

Autonomous task dropping mechanism to achieve robustness in heterogeneous computing systems

A Mokhtari, C Denninnart… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
Robustness of a distributed computing system is defined as the ability to maintain its
performance in the presence of uncertain parameters. Uncertainty is a key problem in …

A multi-gpu parallel genetic algorithm for large-scale vehicle routing problems

M Abdelatti, M Sodhi, R Sendag - 2022 IEEE High Performance …, 2022 - ieeexplore.ieee.org
The Vehicle Routing Problem (VRP) is fundamental to logistics operations. Finding optimal
solutions for VRPs related to large, real-world operations is computationally expensive …

Speculative scheduling for stochastic HPC applications

A Gainaru, GP Aupy, H Sun, P Raghavan - Proceedings of the 48th …, 2019 - dl.acm.org
New emerging fields are developing a growing number of large-scale applications with
heterogeneous, dynamic and data-intensive requirements that put a high emphasis on …

Convergence of high performance computing, big data, and machine learning applications on containerized infrastructures

P Liu - 2023 - upcommons.upc.edu
(English) The convergence of High Performance Computing (HPC), Big Data (BD), and
Machine Learning (ML) in the computing continuum is being pursued in earnest across the …

On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows

A Gainaru, H Sun, G Aupy, Y Huo… - … Journal of High …, 2019 - journals.sagepub.com
Scientific insights in the coming decade will clearly depend on the effective processing of
large data sets generated by dynamic heterogeneous applications typical of workflows in …

Profiles of upcoming HPC Applications and their Impact on Reservation Strategies

A Gainaru, B Goglin, V Honoré… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
With the expected convergence between HPC, BigData and AI, new applications with
different profiles are coming to HPC infrastructures. We aim at better understanding the …

[PDF][PDF] Lessons From Examining Repetitive Job Behavior and I/O Performance Variability on a Production HPC System Emily Costa Northeastern University, USA …

E Costa, T Patel, B Schwaller, J Brandt, D Tiwari - 2021 - osti.gov
As I/O demand of scientific applications increases, identifying, predicting, and analyzing I/O
behaviors is critical to ensure parallel storage systems are efficiently utilized. This paper …