Improving data cache performance by pre-executing instructions under a cache miss

C Ding, S Liao, Y Wang, Z Li, N Liu, Y Zhuo… - Proceedings of the 50th …, 2017 - dl.acm.org

Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the
size of DNNs continues to grow, it is critical to improve the energy efficiency and …

被引用次数：346 相关文章所有 13 个版本网页快照

[PDF] sci-hub [PDF] ieee.org [ 下载加速 ]

DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org

Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

被引用次数：107 相关文章所有 10 个版本网页快照

[PDF] sci-hub [PDF] arxiv.org [ 下载加速 ]

Pythia: A customizable hardware prefetching framework using online reinforcement learning

R Bera, K Kanellopoulos, A Nori, T Shahroodi… - MICRO-54: 54th Annual …, 2021 - dl.acm.org

Past research has proposed numerous hardware prefetching techniques, most of which rely
on exploiting one specific type of program context information (eg, program counter …

被引用次数：89 相关文章所有 7 个版本网页快照

[PDF] sci-hub [PDF] cmu.edu [ 下载加速 ]

A case for exploiting subarray-level parallelism (SALP) in DRAM

Y Kim, V Seshadri, D Lee, J Liu, O Mutlu - ACM SIGARCH Computer …, 2012 - dl.acm.org

Modern DRAMs have multiple banks to serve multiple memory requests in parallel.
However, when two requests go to the same bank, they have to be served serially …

被引用次数：484 相关文章所有 22 个版本网页快照

[PDF] sci-hub [PDF] arxiv.org [ 下载加速 ]

EDEN: Enabling energy-efficient, high-performance deep neural network inference using approximate DRAM

S Koppula, L Orosa, AG Yağlıkçı, R Azizi… - Proceedings of the …, 2019 - dl.acm.org

The effectiveness of deep neural networks (DNN) in vision, speech, and language
processing has prompted a tremendous demand for energy-efficient high-performance DNN …

被引用次数：150 相关文章所有 10 个版本网页快照

[PDF] sci-hub [PDF] psu.edu [ 下载加速 ]

Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems

O Mutlu, T Moscibroda - ACM SIGARCH Computer Architecture News, 2008 - dl.acm.org

In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a
shared DRAM system, requests from athread can not only delay requests from other threads …

被引用次数：761 相关文章所有 10 个版本网页快照

[PDF] sci-hub [PDF] escholarship.org [ 下载加速 ]

Locality exists in graph processing: Workload characterization on an ivy bridge server

S Beamer, K Asanovic… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org

Graph processing is an increasingly important application domain and is typically
communication-bound. In this work, we analyze the performance characteristics of three …

被引用次数：234 相关文章所有 13 个版本网页快照

[PDF] sci-hub [PDF] uwaterloo.ca [ 下载加速 ]

Runahead execution: An alternative to very large instruction windows for out-of-order processors

O Mutlu, J Stark, C Wilkerson… - The Ninth International …, 2003 - ieeexplore.ieee.org

Today's high performance processors tolerate long latency operations by means of out-of-
order execution. However, as latencies increase, the size of the instruction window must …

被引用次数：630 相关文章所有 25 个版本网页快照

[PDF] sci-hub [PDF] susu.ru [ 下载加速 ]

[PDF][PDF] Research problems and opportunities in memory systems

O Mutlu, L Subramanian - Supercomputing frontiers and …, 2014 - superfri.susu.ru

The memory system is a fundamental performance and energy bottleneck in almost all
computing systems. Recent system design, application, and technology trends that require …

被引用次数：249 相关文章所有 22 个版本网页快照

[PDF] sci-hub [PDF] psu.edu [ 下载加速 ]

A case for MLP-aware cache replacement

MK Qureshi, DN Lynch, O Mutlu, YN Patt - ACM SIGARCH Computer …, 2006 - dl.acm.org

Performance loss due to long-latency memory accesses can be reduced by servicing
multiple memory accesses concurrently. The notion of generating and servicing long-latency …

被引用次数：430 相关文章所有 15 个版本网页快照

高级搜索

QQ 群