TY - GEN
T1 - Reactive NUCA
T2 - ISCA 2009 - 36th Annual International Symposium on Computer Architecture
AU - Hardavellas, Nikos
AU - Ferdman, Michael
AU - Falsafi, Babak
AU - Ailamaki, Anastasia
PY - 2009
Y1 - 2009
N2 - Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.
AB - Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.
KW - Design
KW - Experimentation
KW - Performance
UR - http://www.scopus.com/inward/record.url?scp=70350601187&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350601187&partnerID=8YFLogxK
U2 - 10.1145/1555754.1555779
DO - 10.1145/1555754.1555779
M3 - Conference contribution
AN - SCOPUS:70350601187
SN - 9781605585260
T3 - Proceedings - International Symposium on Computer Architecture
SP - 184
EP - 195
BT - ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings
Y2 - 20 June 2009 through 24 June 2009
ER -