Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

Y Wang, C Li, C Liu, S Liu, Y Lei, J Zhang… - CCF Transactions on …, 2021 - Springer
Abstract Digital Signal Processors (DSPs) have been widely used in embedded domains,
delivering high performance with ultra-low power consumption. Such promises make it …

Stash: Have your scratchpad and cache it too

R Komuravelli, MD Sinclair, J Alsop, M Huzaifa… - ACM SIGARCH …, 2015 - dl.acm.org
Heterogeneous systems employ specialization for energy efficiency. Since data movement
is expected to be a dominant consumer of energy, these systems employ specialized …

ApproxHPVM: a portable compiler IR for accuracy-aware optimizations

H Sharif, P Srivastava, M Huzaifa… - Proceedings of the …, 2019 - dl.acm.org
We propose ApproxHPVM, a compiler IR and system designed to enable accuracy-aware
performance and energy tuning on heterogeneous systems with multiple compute units and …

[PDF][PDF] Toward cache-friendly hardware accelerators

YS Shao, S Xi, V Srinivasan, GY Wei… - HPCA Sensors and Cloud …, 2015 - samxi.org
Increasing demand for power-efficient, high-performance computing has spurred a growing
number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as …

A novel DSP architecture for scientific computing and deep learning

C Yang, S Chen, J Zhang, Z Lv, Z Wang - IEEE Access, 2019 - ieeexplore.ieee.org
Exascale computing requires accelerators with ultrahigh power efficiency. Digital signal
processors (DSPs), the most important embedded processors widely known for high power …

WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization

NC Crago, S Damani, K Sankaralingam… - … Symposium on High …, 2024 - ieeexplore.ieee.org
Graphics processing units (GPUs) are an important class of parallel processors that offer
high compute throughput and memory bandwidth. GPUs are used in a variety of important …

Coordinated DMA: improving the DRAM access efficiency for matrix multiplication

S Ma, Z Liu, S Chen, L Huang, Y Guo… - … on Parallel and …, 2019 - ieeexplore.ieee.org
High performance implementation of matrix multiplication is essential for scientific
computing. The memory access procedure is quite possible to be the bottleneck of matrix …

An efficient direct memory access (DMA) controller for scientific computing accelerators

S Ma, L Huang, Y Lei, Y Guo… - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
We design an efficient DMA controller for scientific computing accelerators. It supports
several flexible and powerful transfers, including reshape transfers, parameter linking …

ELF: Maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling

JJK Park, Y Park, S Mahlke - … of the International Conference for High …, 2015 - dl.acm.org
Graphics processing units (GPUs) are increasingly utilized as throughput engines in the
modern computer systems. GPUs rely on fast context switching between thousands of …

CIAO: Cache interference-aware throughput-oriented architecture and scheduling for GPUs

J Zhang, S Gao, NS Kim, M Jung - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
A modern GPU aims to simultaneously execute more warps for higher Thread-Level
Parallelism (TLP) and performance. When generating many memory requests, however …