D2MA: Accelerating coarse-grained data transfer for GPUs

Y Wang, C Li, C Liu, S Liu, Y Lei, J Zhang… - CCF Transactions on …, 2021 - Springer

Abstract Digital Signal Processors (DSPs) have been widely used in embedded domains,
delivering high performance with ultra-low power consumption. Such promises make it …

被引用次数：21 相关文章

[PDF] psu.edu

Stash: Have your scratchpad and cache it too

R Komuravelli, MD Sinclair, J Alsop, M Huzaifa… - ACM SIGARCH …, 2015 - dl.acm.org

Heterogeneous systems employ specialization for energy efficiency. Since data movement
is expected to be a dominant consumer of energy, these systems employ specialized …

被引用次数：101 相关文章所有 11 个版本

[PDF] acm.org

ApproxHPVM: a portable compiler IR for accuracy-aware optimizations

H Sharif, P Srivastava, M Huzaifa… - Proceedings of the …, 2019 - dl.acm.org

We propose ApproxHPVM, a compiler IR and system designed to enable accuracy-aware
performance and energy tuning on heterogeneous systems with multiple compute units and …

被引用次数：28 相关文章所有 13 个版本

[PDF] samxi.org

[PDF][PDF] Toward cache-friendly hardware accelerators

YS Shao, S Xi, V Srinivasan, GY Wei… - HPCA Sensors and Cloud …, 2015 - samxi.org

Increasing demand for power-efficient, high-performance computing has spurred a growing
number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as …

被引用次数：36 相关文章所有 5 个版本

[PDF] ieee.org

A novel DSP architecture for scientific computing and deep learning

C Yang, S Chen, J Zhang, Z Lv, Z Wang - IEEE Access, 2019 - ieeexplore.ieee.org

Exascale computing requires accelerators with ultrahigh power efficiency. Digital signal
processors (DSPs), the most important embedded processors widely known for high power …

被引用次数：20 相关文章所有 2 个版本

[PDF] nealcrago.com

WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization

NC Crago, S Damani, K Sankaralingam… - … Symposium on High …, 2024 - ieeexplore.ieee.org

Graphics processing units (GPUs) are an important class of parallel processors that offer
high compute throughput and memory bandwidth. GPUs are used in a variety of important …

Coordinated DMA: improving the DRAM access efficiency for matrix multiplication

S Ma, Z Liu, S Chen, L Huang, Y Guo… - … on Parallel and …, 2019 - ieeexplore.ieee.org

High performance implementation of matrix multiplication is essential for scientific
computing. The memory access procedure is quite possible to be the bottleneck of matrix …

被引用次数：13 相关文章所有 3 个版本

[PDF] google.com

An efficient direct memory access (DMA) controller for scientific computing accelerators

S Ma, L Huang, Y Lei, Y Guo… - 2019 IEEE International …, 2019 - ieeexplore.ieee.org

We design an efficient DMA controller for scientific computing accelerators. It supports
several flexible and powerful transfers, including reshape transfers, parameter linking …

被引用次数：9 相关文章所有 2 个版本

[PDF] acm.org

ELF: Maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling

JJK Park, Y Park, S Mahlke - … of the International Conference for High …, 2015 - dl.acm.org

Graphics processing units (GPUs) are increasingly utilized as throughput engines in the
modern computer systems. GPUs rely on fast context switching between thousands of …

被引用次数：15 相关文章所有 8 个版本

[PDF] arxiv.org

CIAO: Cache interference-aware throughput-oriented architecture and scheduling for GPUs

J Zhang, S Gao, NS Kim, M Jung - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

A modern GPU aims to simultaneously execute more warps for higher Thread-Level
Parallelism (TLP) and performance. When generating many memory requests, however …

被引用次数：10 相关文章所有 10 个版本

高级搜索

QQ 群