Rigel: An architecture and scalable programming interface for a 1000-core accelerator

Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain

S Cadambi, A Majumdar, M Becchi… - US Patent …, 2013 - Google Patents

An accelerator System is shown that includes a plurality of processing cores. Each
processing core includes a plurality of processing chains configured to perform parallel …

被引用次数：471 相关文章所有 4 个版本

[PDF] acm.org

Why on-chip cache coherence is here to stay

MMK Martin, MD Hill, DJ Sorin - Communications of the ACM, 2012 - dl.acm.org

Why on-chip cache coherence is here to stay Page 1 78 CommuniCations oF the aCm | juLy 2012
| voL. 55 | no. 7 contributed articles shAred MeMorY is the dominant low-level communication …

被引用次数：364 相关文章所有 21 个版本

[PDF] gatech.edu

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

GF Diamos, AR Kerr, S Yalamanchili… - Proceedings of the 19th …, 2010 - dl.acm.org

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel
execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms …

被引用次数：352 相关文章所有 6 个版本

[PDF] github.io

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

DH Woo, NH Seong, DL Lewis… - HPCA-16 2010 The …, 2010 - ieeexplore.ieee.org

Memory bandwidth has become a major performance bottleneck as more and more cores
are integrated onto a single die, demanding more and more data from the system memory …

被引用次数：359 相关文章所有 7 个版本

[PDF] llvm.org

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org

As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

被引用次数：307 相关文章所有 11 个版本

[PDF] stonybrook.edu

DeNovo: Rethinking the memory hierarchy for disciplined parallelism

B Choi, R Komuravelli, H Sung… - 2011 International …, 2011 - ieeexplore.ieee.org

For parallelism to become tractable for mass programmers, shared-memory languages and
environments must evolve to enforce disciplined practices that ban" wild shared-memory …

被引用次数：263 相关文章所有 12 个版本

[PDF] ubc.ca

Thread block compaction for efficient SIMT control flow

WWL Fung, TM Aamodt - 2011 IEEE 17th international …, 2011 - ieeexplore.ieee.org

Manycore accelerators such as graphics processor units (GPUs) organize processing units
into single-instruction, multiple data “cores” to improve throughput per unit hardware cost …

被引用次数：276 相关文章所有 10 个版本

[PDF] illinois.edu

Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces

B Pichai, L Hsu, A Bhattacharjee - ACM SIGARCH Computer Architecture …, 2014 - dl.acm.org

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent
example, necessitates a manageable programming model to ensure widespread adoption …

被引用次数：205 相关文章所有 13 个版本

[PDF] upc.edu

An asymmetric distributed shared memory model for heterogeneous parallel systems

I Gelado, JE Stone, J Cabezas, S Patel… - Proceedings of the …, 2010 - dl.acm.org

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently
execute both sequential control-intensive and data-parallel phases of applications. Existing …

被引用次数：269 相关文章所有 17 个版本

[PDF] psu.edu

Goldmine: Automatic assertion generation using data mining and static analysis

S Vasudevan, D Sheridan, S Patel… - … , Automation & Test …, 2010 - ieeexplore.ieee.org

We present GOLDMINE, a methodology for generating assertions automatically. Our method
involves a combination of data mining and static analysis of the Register Transfer Level …

被引用次数：201 相关文章所有 12 个版本

高级搜索

QQ 群