Polly—performing polyhedral optimizations on a low-level intermediate representation

T Grosser, A Groesslinger, C Lengauer - Parallel Processing Letters, 2012 - World Scientific
The polyhedral model for loop parallelization has proved to be an effective tool for advanced
optimization and automatic parallelization of programs in higher-level languages. Yet, to …

Polyhedral parallel code generation for CUDA

S Verdoolaege, J Carlos Juega, A Cohen… - ACM Transactions on …, 2013 - dl.acm.org
This article addresses the compilation of a sequential program for parallel execution on a
modern GPU. To this end, we present a novel source-to-source compiler called PPCG …

Polly-ACC transparent compilation to heterogeneous hardware

T Grosser, T Hoefler - Proceedings of the 2016 International Conference …, 2016 - dl.acm.org
Programming today's increasingly complex heterogeneous hardware is difficult, as it
commonly requires the use of data-parallel languages, pragma annotations, specialized …

Introducing'Bones' a parallelizing source-to-source compiler based on algorithmic skeletons

C Nugteren, H Corporaal - Proceedings of the 5th Annual Workshop on …, 2012 - dl.acm.org
Recent advances in multi-core and many-core processors requires programmers to exploit
an increasing amount of parallelism from their applications. Data parallel languages such as …

Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs

C Nugteren, H Corporaal - ACM Transactions on Architecture and Code …, 2014 - dl.acm.org
The shift toward parallel processor architectures has made programming and code
generation increasingly challenging. To address this programmability challenge, this article …

TC-CIM: Empowering tensor comprehensions for computing-in-memory

A Drebes, L Chelini, O Zinenko, A Cohen… - 10th International …, 2020 - research.tue.nl
Abstract Memristor-based, non-von-Neumann architectures performing tensor operations
directly in memory are a promising approach to address the ever-increasing demand for …

Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs

P Di, D Ye, Y Su, Y Sui, J Xue - 2012 41st International …, 2012 - ieeexplore.ieee.org
Automatically parallelizing loop nests into CUDA kernels must exploit the full potential of
GPUs to obtain high performance. One state-of-the-art approach makes use of the …

Automatic CPU/GPU generation of multi-versioned OpenCL kernels for C++ scientific applications

R Sotomayor, LM Sanchez, J Garcia Blas… - International journal of …, 2017 - Springer
Parallelism has become one of the most extended paradigms used to improve performance.
However, it forces software developers to adapt applications and coding mechanisms to …

An interactive tool based on polly for detection and parallelization of loops

D Göhringer, J Tepelmann - … of Workshop on Parallel Programming and …, 2014 - dl.acm.org
In many applications, such as signal and image processing, most computation time is spent
within loops. Therefore, these loops are ideal candidates for performance increase when …

Transitioning spiking neural network simulators to heterogeneous hardware

QAP Nguyen, P Andelfinger, WJ Tan, W Cai… - ACM Transactions on …, 2021 - dl.acm.org
Spiking neural networks (SNN) are among the most computationally intensive types of
simulation models, with node counts on the order of up to 1011. Currently, there is intensive …