Optimizing distributed load balancing for workloads with time-varying imbalance

J Lifflander, NL Slattengren, PP Pébaÿ… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
This paper explores dynamic load balancing algorithms used by asynchronous many-task
(AMT), or 'taskbased', programming models to optimize task placement for scientific …

Improved Data Locality Using Morton-order Curve on the Example of LU Decomposition

M Perdacher, C Plant, C Böhm - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
The LU decomposition is an essential element used in many linear algebra applications.
Furthermore, it is used in LINPACK to benchmark the performance of modern multi-core …

Automatic translation of MPI source into a latency-tolerant, data-driven form

T Nguyen, P Cicotti, E Bylaska, D Quinlan… - Journal of Parallel and …, 2017 - Elsevier
Hiding communication behind useful computation is an important performance programming
technique but remains an inscrutable programming exercise even for the expert. We present …

Controlling concurrency and expressing synchronization in charm++ programs

LV Kale, J Lifflander - Concurrent Objects and Beyond: Papers dedicated …, 2014 - Springer
Charm++ is a parallel programming system that evolved over the past 20 years to become a
well-established system for programming parallel science and engineering applications, in …

Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm

T Nguyen, SB Baden - 2015 IEEE International Conference on …, 2015 - ieeexplore.ieee.org
Lookahead is a well-known technique for masking communication in matrix factorization, but
at the cost of complicating application software. We present a new approach, based on …

Многоуровневые алгоритмы отображения параллельных МР1-программ на вычислительные кластеры

АА Пазников, МГ Курносов… - Проблемы …, 2015 - cyberleninka.ru
В работе рассматривается задача отображения параллельных MPI-программ на
иерархические кластерные вычислительные системы (ВС). Требуется по заданному …

Implementation and analysis of block dense matrix decomposition on network-on-chips

TC Xu, T Pahikkala, A Airola, P Liljeberg… - 2012 IEEE 14th …, 2012 - ieeexplore.ieee.org
The decomposition of a dense matrix into lower and upper triangular matrices is an
important linear algebra kernel that used in scientific and engineering applications. To …

Scalable algorithms for constructing balanced spanning trees on system-ranked process groups

A Langer, R Venkataraman, L Kale - Recent Advances in the Message …, 2012 - Springer
Current implementations of process groups (subcommunicators) have non-scalable (O
(group size)) memory footprints and even worse time complexities for setting up …

Parsimonious renewable energy management policies for smart IoT devices

KR Islam, S Tabassum, T Adhikary… - 2016 5th International …, 2016 - ieeexplore.ieee.org
Energy scarcity at homes is becoming a critical issue due to exponential growth of energy
consumption by numerous smart home appliances. Renewable energy sources help to …

[PDF][PDF] Space-filling curves for improved cache-locality in shared memory environments

DIMA Perdacher - 2020 - phaidra.univie.ac.at
Today's microprocessors consist of multiple cores each of which can perform multiple
additions, multiplications, or other operations simultaneously in one clock cycle. In shared …