Profiling a warehouse-scale computer

S Kanev, JP Darago, K Hazelwood… - Proceedings of the …, 2015 - dl.acm.org
With the increasing prevalence of warehouse-scale (WSC) and cloud computing,
understanding the interactions of server applications with the underlying microarchitecture …

Decoupled vector runahead

A Naithani, J Roelandts, S Ainsworth… - Proceedings of the 56th …, 2023 - dl.acm.org
We present Decoupled Vector Runahead (DVR), an in-core prefetching technique,
executing separately to the main application thread, that exploits massive amounts of …

APPROX-NoC: A data approximation framework for network-on-chip architectures

R Boyapati, J Huang, P Majumder, KH Yum… - Proceedings of the 44th …, 2017 - dl.acm.org
The trend of unsustainable power consumption and large memory bandwidth demands in
massively parallel multicore systems, with the advent of the big data era, has brought upon …

An event-triggered programmable prefetcher for irregular workloads

S Ainsworth, TM Jones - ACM Sigplan Notices, 2018 - dl.acm.org
Many modern workloads compute on large amounts of data, often with irregular memory
accesses. Current architectures perform poorly for these workloads, as existing prefetching …

Beyond the wall: Near-data processing for databases

SL Xi, A Augusta, M Athanassoulis… - Proceedings of the 11th …, 2015 - dl.acm.org
The continuous growth of main memory size allows modern data systems to process entire
large scale datasets in memory. The increase in memory capacity, however, is not matched …

Graph prefetching using data structure knowledge

S Ainsworth, TM Jones - … of the 2016 International Conference on …, 2016 - dl.acm.org
Searches on large graphs are heavily memory latency bound, as a result of many high
latency DRAM accesses. Due to the highly irregular nature of the access patterns involved …

Software prefetching for indirect memory accesses

S Ainsworth, TM Jones - 2017 IEEE/ACM International …, 2017 - ieeexplore.ieee.org
Many modern data processing and HPC workloads are heavily memory-latency bound. A
tempting proposition to solve this is software prefetching, where special non-blocking loads …

Pipette: Improving core utilization on irregular applications through intra-core pipeline parallelism

QM Nguyen, D Sanchez - 2020 53rd Annual IEEE/ACM …, 2020 - ieeexplore.ieee.org
Applications with irregular memory accesses and control flow, such as graph algorithms and
sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC …

SpZip: Architectural support for effective data compression in irregular applications

Y Yang, JS Emer, D Sanchez - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
Irregular applications, such as graph analytics and sparse linear algebra, exhibit frequent
indirect, data-dependent accesses to single or short sequences of elements that cause high …

Logca: A high-level performance model for hardware accelerators

MSB Altaf, DA Wood - ACM SIGARCH Computer Architecture News, 2017 - dl.acm.org
With the end of Dennard scaling, architects have increasingly turned to special-purpose
hardware accelerators to improve the performance and energy efficiency for some …