EXSCALATE: An extreme-scale virtual screening platform for drug discovery targeting polypharmacology to fight SARS-CoV-2

D Gadioli, E Vitali, F Ficarelli, C Latini… - … on Emerging Topics …, 2022 - ieeexplore.ieee.org
The social and economic impact of the COVID-19 pandemic demands a reduction of the
time required to find a therapeutic cure. In this paper, we describe the EXSCALATE …

An overview of the Legio fault resilience framework for MPI applications

R Rocco, E Boella, D Gregori, G Palermo - Procedia Computer Science, 2024 - Elsevier
While the computational power of HPC clusters breached through the exascale milestone, it
highlighted the need for some critical features like fault management. The current de facto …

Fault awareness in the mpi 4.0 session model

R Rocco, G Palermo, D Gregori - Proceedings of the 20th ACM …, 2023 - dl.acm.org
MPI version 4.0 introduces new functionalities like the Session model but still lacks fault
management mechanisms. Past efforts produced tools and MPI standard extensions to …

EXSCALATE: An extreme-scale in-silico virtual screening platform to evaluate 1 trillion compounds in 60 hours on 81 PFLOPS supercomputers

D Gadioli, E Vitali, F Ficarelli, C Latini, C Manelfi… - arXiv preprint arXiv …, 2021 - arxiv.org
The social and economic impact of the COVID-19 pandemic demands the reduction of the
time required to find a therapeutic cure. In the contest of urgent computing, we re-designed …

Exploit approximation to support fault resiliency in mpi-based applications

R Rocco, G Palermo - 2023 53rd Annual IEEE/IFIP International …, 2023 - ieeexplore.ieee.org
Approximate applications feature scalability and intrinsic fault resilience, making them
perfect for execution in the HPC scenario. The latter, in particular, is becoming more and …

Fault-aware group-collective communication creation and repair in mpi

R Rocco, G Palermo - European Conference on Parallel Processing, 2023 - Springer
The increasing size of HPC systems indicates that executions involve more nodes and
processes, making the faults' presence a more frequent eventuality. This issue becomes …

The Legio Fault Resilience Framework: Design and Rationale

R Rocco, G Palermo - Proceedings of the 20th ACM International …, 2023 - dl.acm.org
The increasing size of HPC clusters makes fault management mandatory. The current MPI
standard does not specify the behaviour after the incurrence of a fault, precluding any …

To Repair or Not to Repair: Assessing Fault Resilience in MPI Stencil Applications

R Rocco, E Boella, D Gregori, G Palermo - arXiv preprint arXiv …, 2024 - arxiv.org
With the increasing size of HPC computations, faults are becoming more and more relevant
in the HPC field. The MPI standard does not define the application behaviour after a fault …

Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI

R Rocco, L Repetti, E Boella, D Gregori… - 2024 32nd Euromicro …, 2024 - ieeexplore.ieee.org
The presence of faults in distributed executions can compromise the production of results
without proper fault management techniques. The current de-facto standard for inter-process …

Algorithm-based fault-tolerant parallel sorting

ET Camargo, EPD Junior - International Journal of Critical …, 2024 - inderscienceonline.com
High performance computing (HPC) systems often require substantial resources, and can
take up to several hours or days to execute. Upon a failure, it is important to loose as little …