Chimera: efficiently training large-scale neural networks with bidirectional pipelines

S Li, T Hoefler - Proceedings of the International Conference for High …, 2021 - dl.acm.org
Training large deep learning models at scale is very challenging. This paper proposes
Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for …

An overview of MPI characteristics of exascale proxy applications

B Klenk, H Fröning - … Computing: 32nd International Conference, ISC High …, 2017 - Springer
The scale of applications and computing systems is tremendously increasing and needs to
increase even more to realize exascale systems. As the number of nodes keeps growing …

Distributed quantum computing with QMPI

T Häner, DS Steiger, T Hoefler, M Troyer - Proceedings of the …, 2021 - dl.acm.org
Practical applications of quantum computers require millions of physical qubits and it will be
challenging for individual quantum processors to reach such qubit numbers. It is therefore …

Hiding global communication latency in the GMRES algorithm on massively parallel machines

P Ghysels, TJ Ashby, K Meerbergen… - SIAM journal on scientific …, 2013 - SIAM
In the generalized minimal residual method (GMRES), the global all-to-all communication
required in each iteration for orthogonalization and normalization of the Krylov base vectors …

Message progression in parallel computing-to thread or not to thread?

T Hoefler, A Lumsdaine - 2008 IEEE International Conference …, 2008 - ieeexplore.ieee.org
Message progression schemes that enable communication and computation to be
overlapped have the potential to improve the performance of parallel applications. With …

Towards efficient mapreduce using mpi

T Hoefler, A Lumsdaine, J Dongarra - European Parallel Virtual Machine …, 2009 - Springer
MapReduce is an emerging programming paradigm for data-parallel applications. We
discuss common strategies to implement a MapReduce runtime and propose an optimized …

Mitigating network noise on dragonfly networks through application-aware routing

D De Sensi, S Di Girolamo, T Hoefler - Proceedings of the International …, 2019 - dl.acm.org
System noise can negatively impact the performance of HPC systems, and the
interconnection network is one of the main factors contributing to this problem. To mitigate …

Performance analysis of asynchronous Jacobi's method implemented in MPI, SHMEM and OpenMP

I Bethune, JM Bull, NJ Dingle… - … International Journal of …, 2014 - journals.sagepub.com
Ever-increasing core counts create the need to develop parallel algorithms that avoid
closely coupled execution across all cores. We present performance analysis of several …

Library Development with MPI: Attributes, Request Objects, Group Communicator Creation, Local Reductions, and Datatypes

JL Träff, I Vardas - Proceedings of the 30th European MPI Users' Group …, 2023 - dl.acm.org
A major design objective of MPI is to enable support for the construction of safe parallel
libraries that can be used and mixed freely in complex applications. In this respect, MPI has …

Distributed adaptive routing convergence to non-blocking DCN routing assignments

E Zahavi, I Keslassy, A Kolodny - IEEE Journal on Selected …, 2013 - ieeexplore.ieee.org
With the growing popularity of big-data applications, Data Center Networks (DCN)
increasingly carry larger and longer traffic flows. As a result of this increased flow granularity …