ForestGOMP: an efficient OpenMP environment for NUMA architectures

M Diener, EHM Cruz, MAZ Alves, POA Navaux… - ACM Computing …, 2016 - dl.acm.org

Shared memory architectures have recently experienced a large increase in thread-level
parallelism, leading to complex memory hierarchies with multiple cache memory levels and …

被引用次数：54 相关文章所有 6 个版本

[PDF] ieee.org

Argobots: A lightweight low-level threading and tasking framework

S Seo, A Amer, P Balaji, C Bordage… - … on Parallel and …, 2017 - ieeexplore.ieee.org

In the past few decades, a number of user-level threading and tasking models have been
proposed in the literature to address the shortcomings of OS-level threads, primarily with …

被引用次数：156 相关文章所有 17 个版本

[PDF] github.io

memif Towards Programming Heterogeneous Memory Asynchronously

FX Lin, X Liu - ACM SIGPLAN Notices, 2016 - dl.acm.org

To harness a heterogeneous memory hierarchy, it is advantageous to integrate application
knowledge in guiding frequent memory move, ie, replicating or migrating virtual memory …

被引用次数：77 相关文章所有 4 个版本

A tool to analyze the performance of multithreaded programs on NUMA architectures

X Liu, J Mellor-Crummey - ACM Sigplan Notices, 2014 - dl.acm.org

Almost all of today's microprocessors contain memory controllers and directly attach to
memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is …

被引用次数：90 相关文章所有 3 个版本

[PDF] arxiv.org

Learning intermediate representations using graph neural networks for numa and prefetchers optimization

A TehraniJamsaz, M Popov, A Dutta… - 2022 IEEE …, 2022 - ieeexplore.ieee.org

There is a large space of NUMA and hardware prefetcher configurations that can
significantly impact the performance of an application. Previous studies have demonstrated …

被引用次数：16 相关文章所有 8 个版本

[PDF] nsf.gov

Locality-centric data and threadblock management for massive GPUs

M Khairy, V Nikiforov, D Nellans… - 2020 53rd Annual IEEE …, 2020 - ieeexplore.ieee.org

Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …

被引用次数：30 相关文章所有 8 个版本

[PDF] acm.org

Modeling and optimizing numa effects and prefetching with machine learning

I Sánchez Barrera, D Black-Schaffer, M Casas… - Proceedings of the 34th …, 2020 - dl.acm.org

Both NUMA thread/data placement and hardware prefetcher configuration have significant
impacts on HPC performance. Optimizing both together leads to a large and complex design …

被引用次数：35 相关文章所有 3 个版本

[PDF] acm.org

Efficient thread/page/parallelism autotuning for numa systems

M Popov, A Jimborean, D Black-Schaffer - Proceedings of the ACM …, 2019 - dl.acm.org

Current multi-socket systems have complex memory hierarchies with significant Non-
Uniform Memory Access (NUMA) effects: memory performance depends on the location of …

被引用次数：38 相关文章所有 4 个版本

[PDF] hal.science

Numamma: Numa memory analyzer

F Trahay, M Selva, L Morel, K Marquet - Proceedings of the 47th …, 2018 - dl.acm.org

Non Uniform Memory Access (NUMA) architectures are nowadays common for running High-
Performance Computing (HPC) applications. In such architectures, several distinct physical …

被引用次数：34 相关文章所有 6 个版本

[PDF] hal.science

Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management

A Drebes, A Pop, K Heydemann, A Cohen… - Proceedings of the 2016 …, 2016 - dl.acm.org

Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …

被引用次数：44 相关文章所有 9 个版本

高级搜索

QQ 群