Dynamic warp subdivision for integrated branch and memory divergence tolerance

J Meng, D Tarjan, K Skadron - Proceedings of the 37th annual …, 2010 - dl.acm.org
SIMD organizations amortize the area and power of fetch, decode, and issue logic across
multiple processing units in order to maximize throughput for a given area and power …

Mixing multi-core CPUs and GPUs for scientific simulation software

KA Hawick, A Leist, DP Playne - 2010 - mro.massey.ac.nz
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units …

Data-parallel techniques for simulating a mega-scale agent-based model of systemic inflammatory response syndrome on graphics processing units

S Alberts, MK Keenan, RM D'Souza, G An - Simulation, 2012 - journals.sagepub.com
Agent-based modeling is increasingly being used for computer simulation of complex
biological systems. An agent-based model (ABM) is a bottom-up simulation where the bulk …

Data mining analysis to validate performance tuning practices for HPL

TZ Tan, RSM Goh, V March… - 2009 IEEE international …, 2009 - ieeexplore.ieee.org
Applications performance is a criterion for system evaluation, and hence performance tuning
for these applications is of great interest. One such benchmark application is High …

Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture

D Wu, X Zou, K Dai, J Rao, P Chen, Z Zheng - Journal of Zhejiang …, 2011 - Springer
The fast Fourier transform (FFT) is a fundamental kernel of many computation-intensive
scientific applications. This paper deals with an implementation of the FFT on the accelerator …

Accelerating BLAS on custom architecture through algorithm-architecture co-design

F Merchant, T Vatwani, A Chattopadhyay… - arXiv preprint arXiv …, 2016 - arxiv.org
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific
computing applications. Experimentally, yesteryear multicore and General Purpose …

[PDF][PDF] Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision

J Meng, D Tarjan, K Skadron - … , University of Virginia, Tech. Rep. CS …, 2009 - academia.edu
SIMD organizations have shown to allow high throughput for data-parallel applications.
They can operate on multiple datapaths under the same instruction sequencer, with its set of …

Algorithm/architecture codesign of low power and high performance linear algebra compute fabrics

A Pedram - 2013 IEEE International Symposium on Parallel & …, 2013 - ieeexplore.ieee.org
We show the design of specialized compute fabrics that maintain the efficiency of full custom
hardware while providing enough flexibility to execute a whole class of coarse-grain linear …

[图书][B] A finite domain constraint approach for placement and routing of coarse-grained reconfigurable architectures

R Saraswat - 2010 - search.proquest.com
Scheduling, placement, and routing are important steps in Very Large Scale Integration
(VLSI) design. Researchers have developed numerous techniques to solve placement and …

A chemical reactor benchmark for parallel adaptive control using feedforward neural networks

CO Cajueiro, EM Hemerly - Proceedings. Vol. 1. Sixth Brazilian …, 2000 - ieeexplore.ieee.org
This paper applies a parallel scheme for adaptive control that uses only one neural network
to a CSTR (continuous stirred tank reactor). Convergence of the identification error is …