Affinity-based thread and data mapping in shared memory systems

M Diener, EHM Cruz, MAZ Alves, POA Navaux… - ACM Computing …, 2016 - dl.acm.org
Shared memory architectures have recently experienced a large increase in thread-level
parallelism, leading to complex memory hierarchies with multiple cache memory levels and …

Argobots: A lightweight low-level threading and tasking framework

S Seo, A Amer, P Balaji, C Bordage… - … on Parallel and …, 2017 - ieeexplore.ieee.org
In the past few decades, a number of user-level threading and tasking models have been
proposed in the literature to address the shortcomings of OS-level threads, primarily with …

memif Towards Programming Heterogeneous Memory Asynchronously

FX Lin, X Liu - ACM SIGPLAN Notices, 2016 - dl.acm.org
To harness a heterogeneous memory hierarchy, it is advantageous to integrate application
knowledge in guiding frequent memory move, ie, replicating or migrating virtual memory …

A tool to analyze the performance of multithreaded programs on NUMA architectures

X Liu, J Mellor-Crummey - ACM Sigplan Notices, 2014 - dl.acm.org
Almost all of today's microprocessors contain memory controllers and directly attach to
memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is …

Learning intermediate representations using graph neural networks for numa and prefetchers optimization

A TehraniJamsaz, M Popov, A Dutta… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
There is a large space of NUMA and hardware prefetcher configurations that can
significantly impact the performance of an application. Previous studies have demonstrated …

Locality-centric data and threadblock management for massive GPUs

M Khairy, V Nikiforov, D Nellans… - 2020 53rd Annual IEEE …, 2020 - ieeexplore.ieee.org
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …

Modeling and optimizing numa effects and prefetching with machine learning

I Sánchez Barrera, D Black-Schaffer, M Casas… - Proceedings of the 34th …, 2020 - dl.acm.org
Both NUMA thread/data placement and hardware prefetcher configuration have significant
impacts on HPC performance. Optimizing both together leads to a large and complex design …

Efficient thread/page/parallelism autotuning for numa systems

M Popov, A Jimborean, D Black-Schaffer - Proceedings of the ACM …, 2019 - dl.acm.org
Current multi-socket systems have complex memory hierarchies with significant Non-
Uniform Memory Access (NUMA) effects: memory performance depends on the location of …

Numamma: Numa memory analyzer

F Trahay, M Selva, L Morel, K Marquet - Proceedings of the 47th …, 2018 - dl.acm.org
Non Uniform Memory Access (NUMA) architectures are nowadays common for running High-
Performance Computing (HPC) applications. In such architectures, several distinct physical …

Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management

A Drebes, A Pop, K Heydemann, A Cohen… - Proceedings of the 2016 …, 2016 - dl.acm.org
Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …