Petabricks: A language and compiler for algorithmic choice

J Ansel, C Chan, YL Wong, M Olszewski, Q Zhao… - ACM Sigplan …, 2009 - dl.acm.org
It is often impossible to obtain a one-size-fits-all solution for high performance algorithms
when considering different choices for data distributions, parallelism, transformations, and …

Language and compiler support for auto-tuning variable-accuracy algorithms

J Ansel, YL Wong, C Chan, M Olszewski… - … Symposium on Code …, 2011 - ieeexplore.ieee.org
Approximating ideal program outputs is a common technique for solving computationally
difficult problems, for adhering to processing or timing constraints, and for performance …

Discrete Fourier transform on multicore

F Franchetti, M Puschel, Y Voronenko… - IEEE Signal …, 2009 - ieeexplore.ieee.org
This article gives an overview on the techniques needed to implement the discrete Fourier
transform (DFT) efficiently on current multicore systems. The focus is on Intel-compatible …

[PDF][PDF] Topologically adaptive parallel breadth-first search on multicore processors

Y Xia, VK Prasanna - Proc. 21st Int'l. Conf. on Parallel and Distributed …, 2009 - Citeseer
Breadth-first Search (BFS) is a fundamental graph theory algorithm that is extensively used
to abstract various challenging computational problems. Due to the fine-grained irregular …

xmath2. 0: a high-performance extended math library for sw26010-pro many-core processor

F Liu, W Ma, Y Zhao, D Chen, Y Hu, Q Lu… - CCF Transactions on …, 2023 - Springer
High performance extended math library is used by many scientific engineering and artificial
intelligence applications, which usually involves many common mathematical computations …

MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework

Y Zhao, F Liu, W Ma, H Li, Y Peng… - ACM Transactions on …, 2023 - dl.acm.org
Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel
programs, and data communication is the main performance bottleneck of FFT and seriously …

Unconventional parallelization of nondeterministic applications

EA Deiana, V St-Amour, PA Dinda… - Proceedings of the …, 2018 - dl.acm.org
The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is
essential for gaining performance and saving energy. However, TLP in today's programs is …

Using hybrid parallelism to improve memory use in the Uintah framework

Q Meng, M Berzins, J Schmidt - Proceedings of the 2011 TeraGrid …, 2011 - dl.acm.org
The Uintah Software framework was developed to provide an environment for solving fluid-
structure interaction problems on structured adaptive grids on large-scale, long-running …

Fast: A fast stencil autotuning framework based on an optimal-solution space model

Y Luo, G Tan, Z Mo, N Sun - Proceedings of the 29th ACM on …, 2015 - dl.acm.org
Stencil computations comprise an important class of kernels in many scientific computing
applications. As the diversity of both architectures and programming models grow …

Implementation and evaluation of a microthread architecture

K Bousias, L Guang, CR Jesshope… - Journal of Systems …, 2009 - Elsevier
Future many-core processor systems require scalable solutions that conventional
architectures currently do not provide. This paper presents a novel architecture that …