Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain

S Cadambi, A Majumdar, M Becchi… - US Patent …, 2013 - Google Patents
An accelerator System is shown that includes a plurality of processing cores. Each
processing core includes a plurality of processing chains configured to perform parallel …

Why on-chip cache coherence is here to stay

MMK Martin, MD Hill, DJ Sorin - Communications of the ACM, 2012 - dl.acm.org
Why on-chip cache coherence is here to stay Page 1 78 CommuniCations oF the aCm | juLy 2012
| voL. 55 | no. 7 contributed articles shAred MeMorY is the dominant low-level communication …

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

GF Diamos, AR Kerr, S Yalamanchili… - Proceedings of the 19th …, 2010 - dl.acm.org
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel
execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms …

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

DH Woo, NH Seong, DL Lewis… - HPCA-16 2010 The …, 2010 - ieeexplore.ieee.org
Memory bandwidth has become a major performance bottleneck as more and more cores
are integrated onto a single die, demanding more and more data from the system memory …

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

DeNovo: Rethinking the memory hierarchy for disciplined parallelism

B Choi, R Komuravelli, H Sung… - 2011 International …, 2011 - ieeexplore.ieee.org
For parallelism to become tractable for mass programmers, shared-memory languages and
environments must evolve to enforce disciplined practices that ban" wild shared-memory …

Thread block compaction for efficient SIMT control flow

WWL Fung, TM Aamodt - 2011 IEEE 17th international …, 2011 - ieeexplore.ieee.org
Manycore accelerators such as graphics processor units (GPUs) organize processing units
into single-instruction, multiple data “cores” to improve throughput per unit hardware cost …

Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces

B Pichai, L Hsu, A Bhattacharjee - ACM SIGARCH Computer Architecture …, 2014 - dl.acm.org
The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent
example, necessitates a manageable programming model to ensure widespread adoption …

An asymmetric distributed shared memory model for heterogeneous parallel systems

I Gelado, JE Stone, J Cabezas, S Patel… - Proceedings of the …, 2010 - dl.acm.org
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently
execute both sequential control-intensive and data-parallel phases of applications. Existing …

Goldmine: Automatic assertion generation using data mining and static analysis

S Vasudevan, D Sheridan, S Patel… - … , Automation & Test …, 2010 - ieeexplore.ieee.org
We present GOLDMINE, a methodology for generating assertions automatically. Our method
involves a combination of data mining and static analysis of the Register Transfer Level …