{LegoOS}: A disseminated, distributed {OS} for hardware resource disaggregation

Y Shan, Y Huang, Y Chen, Y Zhang - 13th USENIX Symposium on …, 2018 - usenix.org
The monolithic server model where a server is the unit of deployment, operation, and failure
is meeting its limits in the face of several recent hardware and application trends. To improve …

Syncron: Efficient synchronization support for near-data-processing architectures

C Giannoula, N Vijaykumar… - … Symposium on High …, 2021 - ieeexplore.ieee.org
Near-Data-Processing (NDP) architectures present a promising way to alleviate data
movement costs and can provide significant performance and energy benefits to parallel …

Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces

B Pichai, L Hsu, A Bhattacharjee - ACM SIGARCH Computer Architecture …, 2014 - dl.acm.org
The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent
example, necessitates a manageable programming model to ensure widespread adoption …

A survey of techniques for architecting TLBs

S Mittal - Concurrency and computation: practice and …, 2017 - Wiley Online Library
Translation lookaside buffer (TLB) caches virtual to physical address translation information
and is used in systems ranging from embedded devices to high‐end servers. Because TLB …

Write-light cache for energy harvesting systems

J Choi, J Zeng, D Lee, C Min, C Jung - Proceedings of the 50th Annual …, 2023 - dl.acm.org
Energy harvesting system has huge potential to enable battery-less Internet of Things (IoT)
services. However, it has been designed without a cache due to the difficulty of crash …

Mosaic pages: Big TLB reach with small pages

K Gosakan, J Han, W Kuszmaul, IN Mubarek… - Proceedings of the 28th …, 2023 - dl.acm.org
The TLB is increasingly a bottleneck for big data applications. In most designs, the number
of TLB entries are highly constrained by latency requirements, and growing much more …

Border control: Sandboxing accelerators

LE Olson, J Power, MD Hill, DA Wood - Proceedings of the 48th …, 2015 - dl.acm.org
As hardware accelerators proliferate, there is a desire to logically integrate them more tightly
with CPUs through interfaces such as shared virtual memory. Although this integration has …

Selective GPU caches to eliminate CPU-GPU HW cache coherence

N Agarwal, D Nellans, E Ebrahimi… - … Symposium on High …, 2016 - ieeexplore.ieee.org
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a
simple, high performance memory abstraction to programmers. Recent work suggests …

Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

K Kanellopoulos, HC Nam, N Bostanci, R Bera… - Proceedings of the 56th …, 2023 - dl.acm.org
Address translation is a performance bottleneck in data-intensive workloads due to large
datasets and irregular access patterns that lead to frequent high-latency page table walks …

Turning centralized coherence and distributed critical-section execution on their head: A new approach for scalable distributed shared memory

S Kaxiras, D Klaftenegger, M Norgren, A Ros… - Proceedings of the 24th …, 2015 - dl.acm.org
A coherent global address space in a distributed system enables shared memory
programming in a much larger scale than a single multicore or a single SMP. Without …