Understanding GPU power: A survey of profiling, modeling, and simulation methods

RA Bridges, N Imam, TM Mintz - ACM Computing Surveys (CSUR), 2016 - dl.acm.org
Modern graphics processing units (GPUs) have complex architectures that admit
exceptional performance and energy efficiency for high-throughput applications. Although …

End-to-end deep learning of optimization heuristics

C Cummins, P Petoumenos, Z Wang… - 2017 26th …, 2017 - ieeexplore.ieee.org
Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and
diversity of modern hardware and software. Machine learning is aproven technique for …

A survey on agent-based simulation using hardware accelerators

J Xiao, P Andelfinger, D Eckhoff, W Cai… - ACM Computing Surveys …, 2019 - dl.acm.org
Due to decelerating gains in single-core CPU performance, computationally expensive
simulations are increasingly executed on highly parallel hardware platforms. Agent-based …

GPGPU performance and power estimation using machine learning

G Wu, JL Greathouse, A Lyashevsky… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) have numerous configuration and design options,
including core frequency, number of parallel compute units (CUs), and available memory …

A simplified and accurate model of power-performance efficiency on emergent GPU architectures

S Song, C Su, B Rountree… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org
Emergent heterogeneous systems must be optimized for both power and performance at
exascale. Massive parallelism combined with complex memory hierarchies form a barrier to …

Demystifying tensorrt: Characterizing neural network inference engine on nvidia edge devices

O Shafi, C Rai, R Sen… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Edge devices are seeing tremendous growth in sensing and computational capabilities.
Running state-of-the-art deep neural network (NN) based data processing on multi-core …

Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling

J Zhong, B He - IEEE Transactions on Parallel and Distributed …, 2013 - ieeexplore.ieee.org
Graphics processors, or GPUs, have recently been widely used as accelerators in shared
environments such as clusters and clouds. In such shared environments, many kernels are …

[图书][B] Understanding latency hiding on GPUs

V Volkov - 2016 - search.proquest.com
Modern commodity processors such as GPUs may execute up to about a thousand of
physical threads per chip to better utilize their numerous execution units and hide execution …

Automated smartnic offloading insights for network functions

Y Qiu, J Xing, KF Hsu, Q Kang, M Liu… - Proceedings of the …, 2021 - dl.acm.org
The gap between CPU and networking speeds has motivated the development of
SmartNICs for NF (network functions) offloading. However, offloading performance is …

A performance analysis framework for optimizing OpenCL applications on FPGAs

Z Wang, B He, W Zhang, S Jiang - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for
programming FPGAs. However, the architecture of FPGA is significantly different from that of …