CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices
Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the
size of DNNs continues to grow, it is critical to improve the energy efficiency and …
size of DNNs continues to grow, it is critical to improve the energy efficiency and …
DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
Pythia: A customizable hardware prefetching framework using online reinforcement learning
Past research has proposed numerous hardware prefetching techniques, most of which rely
on exploiting one specific type of program context information (eg, program counter …
on exploiting one specific type of program context information (eg, program counter …
A case for exploiting subarray-level parallelism (SALP) in DRAM
Modern DRAMs have multiple banks to serve multiple memory requests in parallel.
However, when two requests go to the same bank, they have to be served serially …
However, when two requests go to the same bank, they have to be served serially …
EDEN: Enabling energy-efficient, high-performance deep neural network inference using approximate DRAM
The effectiveness of deep neural networks (DNN) in vision, speech, and language
processing has prompted a tremendous demand for energy-efficient high-performance DNN …
processing has prompted a tremendous demand for energy-efficient high-performance DNN …
Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems
O Mutlu, T Moscibroda - ACM SIGARCH Computer Architecture News, 2008 - dl.acm.org
In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a
shared DRAM system, requests from athread can not only delay requests from other threads …
shared DRAM system, requests from athread can not only delay requests from other threads …
Locality exists in graph processing: Workload characterization on an ivy bridge server
S Beamer, K Asanovic… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
Graph processing is an increasingly important application domain and is typically
communication-bound. In this work, we analyze the performance characteristics of three …
communication-bound. In this work, we analyze the performance characteristics of three …
Runahead execution: An alternative to very large instruction windows for out-of-order processors
O Mutlu, J Stark, C Wilkerson… - The Ninth International …, 2003 - ieeexplore.ieee.org
Today's high performance processors tolerate long latency operations by means of out-of-
order execution. However, as latencies increase, the size of the instruction window must …
order execution. However, as latencies increase, the size of the instruction window must …
[PDF][PDF] Research problems and opportunities in memory systems
O Mutlu, L Subramanian - Supercomputing frontiers and …, 2014 - superfri.susu.ru
The memory system is a fundamental performance and energy bottleneck in almost all
computing systems. Recent system design, application, and technology trends that require …
computing systems. Recent system design, application, and technology trends that require …
A case for MLP-aware cache replacement
MK Qureshi, DN Lynch, O Mutlu, YN Patt - ACM SIGARCH Computer …, 2006 - dl.acm.org
Performance loss due to long-latency memory accesses can be reduced by servicing
multiple memory accesses concurrently. The notion of generating and servicing long-latency …
multiple memory accesses concurrently. The notion of generating and servicing long-latency …