Argobots: A lightweight low-level threading and tasking framework

S Seo, A Amer, P Balaji, C Bordage… - … on Parallel and …, 2017 - ieeexplore.ieee.org
In the past few decades, a number of user-level threading and tasking models have been
proposed in the literature to address the shortcomings of OS-level threads, primarily with …

CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications

Y Yang, H Zhou - ACM SIGPLAN Notices, 2014 - dl.acm.org
Parallel programs consist of series of code sections with different thread-level parallelism
(TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU …

BOLT: Optimizing OpenMP parallel regions with user-level threads

S Iwasaki, A Amer, K Taura, S Seo… - 2019 28th International …, 2019 - ieeexplore.ieee.org
OpenMP is widely used by a number of applications, computational libraries, and runtime
systems. As a result, multiple levels of the software stack use OpenMP independently of one …

[图书][B] Heterogeneous computing architectures: Challenges and vision

O Terzo, K Djemame, A Scionti, C Pezuela - 2019 - books.google.com
Heterogeneous Computing Architectures: Challenges and Vision provides an updated
vision of the state-of-the-art of heterogeneous computing systems, covering all the aspects …

Real-time scheduling and analysis of OpenMP DAG tasks supporting nested parallelism

J Sun, N Guan, F Li, H Gao, C Shi… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
OpenMP is a promising framework to develop parallel real-time software on multi-cores.
Although similar to the DAG task model, OpenMP task systems are significantly more difficult …

A microbenchmark study of OpenMP overheads under nested parallelism

VV Dimakopoulos, PE Hadjidoukas… - OpenMP in a New Era of …, 2008 - Springer
In this work we present a microbenchmark methodology for assessing the overheads
associated with nested parallelism in OpenMP. Our techniques are based on extensions to …

Data access history cache and associated data prefetching mechanisms

Y Chen, S Byna, XH Sun - Proceedings of the 2007 ACM/IEEE …, 2007 - dl.acm.org
Data prefetching is an effective way to bridge the increasing performance gap between
processor and memory. As computing power is increasing much faster than memory …

Scheduling dynamic OpenMP applications over multicore architectures

F Broquedis, F Diakhaté, S Thibault, O Aumage… - OpenMP in a New Era of …, 2008 - Springer
Approaching the theoretical performance of hierarchical multicore machines requires a very
careful distribution of threads and data among the underlying non-uniform architecture in …

Fast and lightweight support for nested parallelism on cluster-based embedded many-cores

A Marongiu, P Burgio, L Benini - 2012 Design, Automation & …, 2012 - ieeexplore.ieee.org
Several recent many-core accelerators have been architected as fabrics of tightly-coupled
shared memory clusters. A hierarchical interconnection system is used-with a crossbar-like …

GLTO: On the adequacy of lightweight thread approaches for OpenMP implementations

A Castelló, S Seo, R Mayo, P Balaji… - 2017 46th …, 2017 - ieeexplore.ieee.org
OpenMP is the de facto standard application programming interface (API) for on-node
parallelism. The most popular OpenMP runtimes rely on POSIX threads (pthreads) …