Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

GH Loh, MD Hill - Proceedings of the 44th Annual IEEE/ACM …, 2011 - dl.acm.org
Proceedings of the 44th Annual IEEE/ACM International Symposium on …, 2011dl.acm.org
Die-stacking technology enables multiple layers of DRAM to be integrated with multicore
processors. A promising use of stacked DRAM is as a cache, since its capacity is insufficient
to be all of main memory (for all but some embedded systems). However, a 1GB DRAM
cache with 64-byte blocks requires 96MB of tag storage. Placing these tags on-chip is
impractical (larger than on-chip L3s) while putting them in DRAM is slow (two full DRAM
accesses for tag and data). Larger blocks and sub-blocking are possible, but less robust due …
Die-stacking technology enables multiple layers of DRAM to be integrated with multicore processors. A promising use of stacked DRAM is as a cache, since its capacity is insufficient to be all of main memory (for all but some embedded systems). However, a 1GB DRAM cache with 64-byte blocks requires 96MB of tag storage. Placing these tags on-chip is impractical (larger than on-chip L3s) while putting them in DRAM is slow (two full DRAM accesses for tag and data). Larger blocks and sub-blocking are possible, but less robust due to fragmentation.
This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations. First, we make hits faster than just storing tags in stacked DRAM by scheduling the tag and data accesses as a compound access so the data access is always a row buffer hit. Second, we make misses faster with a MissMap that eschews stacked-DRAM access on all misses. Like extreme sub-blocking, our implementation of the MissMap stores a vector of block-valid bits for each "page" in the DRAM cache. Unlike conventional sub-blocking, the MissMap (a) points to many more pages than can be stored in the DRAM cache (making the effects of fragmentation rare) and (b) does not point to the "way" that holds a block (but defers to the off-chip tags).
For the evaluated large-footprint commercial workloads, the proposed cache organization delivers 92.9% of the performance benefit of an ideal 1GB DRAM cache with an impractical 96MB on-chip SRAM tag array.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果