Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

Run-to-run variability on Xeon Phi based Cray XC systems

S Chunduri, K Harms, S Parker, V Morozov… - Proceedings of the …, 2017 - dl.acm.org
The increasing complexity of HPC systems has introduced new sources of variability, which
can contribute to significant differences in run-to-run performance of applications. With …

Evaluation of an interference-free node allocation policy on fat-tree clusters

SD Pollard, N Jain, S Herbein… - … Conference for High …, 2018 - ieeexplore.ieee.org
Interference between jobs competing for network bandwidth on a fat-tree cluster can cause
significant variability and degradation in performance. These performance issues can be …

HyperX topology: First at-scale implementation and comparison to the fat-tree

J Domke, S Matsuoka, IR Ivanov, Y Tsushima… - Proceedings of the …, 2019 - dl.acm.org
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos
networks, commonly known as Fat-Trees. The number of network endpoints in these …

Performance optimality or reproducibility: that is the question

T Patki, JJ Thiagarajan, A Ayala, TZ Islam - Proceedings of the …, 2019 - dl.acm.org
The era of extremely heterogeneous supercomputing brings with itself the devil of increased
performance variation and reduced reproducibility. There is a lack of understanding in the …

Monitoring large scale supercomputers: A case study with the lassen supercomputer

T Patki, A Bertsch, I Karlin, DH Ahn… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Scalable management of user workloads on large-scale supercomputers remains a
challenge due to the tradeoff between capturing adequate detail for analysis from various …

Edge-Disjoint Tree Allocation for Multi-Tenant Cloud Security in Datacenter Topologies

O Rottenstreich, J Yallouz - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Resource sharing with its implied mutual interference has been considered a major concern
for running applications of multiple tenants in shared cloud datacenters. Besides its security …

Analyzing cost-performance tradeoffs of hpc network designs under different constraints using simulations

A Bhatele, N Jain, M Mubarak, T Gamblin - Proceedings of the 2019 …, 2019 - dl.acm.org
Identifying a suitable network topology and deciding its optimal configuration parameters are
critical aspects of the overall HPC system design, procurement and installation process …

Throttling network bandwidth using per-node network interfaces

N Moldvai, M Malpani - US Patent 10,819,656, 2020 - Google Patents
Methods and systems for throttling per-node network bandwidths over time to maximize the
aggregate bandwidth of a distributed cluster of nodes without exceeding a global bandwidth …

Chunk allocation

G Juniwal, G Jain, A Gee - US Patent 11,030,062, 2021 - Google Patents
Methods and systems for identifying a set of disks within a cluster and then storing a plurality
of data chunks into the set of disks such that the placement of the plurality of data chunks …