Reactive NUCA: Near-optimal block placement and replication in distributed caches

Nikos Hardavellas*, Michael Ferdman, Babak Falsafi, Anastasia Ailamaki

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

258 Citations (Scopus)

Abstract

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

Original languageEnglish (US)
Title of host publicationISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings
Pages184-195
Number of pages12
DOIs
StatePublished - Nov 30 2009
EventISCA 2009 - 36th Annual International Symposium on Computer Architecture - Austin, TX, United States
Duration: Jun 20 2009Jun 24 2009

Publication series

NameProceedings - International Symposium on Computer Architecture
ISSN (Print)1063-6897

Other

OtherISCA 2009 - 36th Annual International Symposium on Computer Architecture
CountryUnited States
CityAustin, TX
Period6/20/096/24/09

Fingerprint

Servers
Communication
Wire
Data storage equipment

Keywords

  • Design
  • Experimentation
  • Performance

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Hardavellas, N., Ferdman, M., Falsafi, B., & Ailamaki, A. (2009). Reactive NUCA: Near-optimal block placement and replication in distributed caches. In ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings (pp. 184-195). (Proceedings - International Symposium on Computer Architecture). https://doi.org/10.1145/1555754.1555779
Hardavellas, Nikos ; Ferdman, Michael ; Falsafi, Babak ; Ailamaki, Anastasia. / Reactive NUCA : Near-optimal block placement and replication in distributed caches. ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings. 2009. pp. 184-195 (Proceedings - International Symposium on Computer Architecture).
@inproceedings{1e53a693352a49d5a28ee9529b5608db,
title = "Reactive NUCA: Near-optimal block placement and replication in distributed caches",
abstract = "Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14{\%} on average over competing designs and by 32{\%} at best, while achieving performance within 5{\%} of an ideal cache design.",
keywords = "Design, Experimentation, Performance",
author = "Nikos Hardavellas and Michael Ferdman and Babak Falsafi and Anastasia Ailamaki",
year = "2009",
month = "11",
day = "30",
doi = "10.1145/1555754.1555779",
language = "English (US)",
isbn = "9781605585260",
series = "Proceedings - International Symposium on Computer Architecture",
pages = "184--195",
booktitle = "ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings",

}

Hardavellas, N, Ferdman, M, Falsafi, B & Ailamaki, A 2009, Reactive NUCA: Near-optimal block placement and replication in distributed caches. in ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings. Proceedings - International Symposium on Computer Architecture, pp. 184-195, ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Austin, TX, United States, 6/20/09. https://doi.org/10.1145/1555754.1555779

Reactive NUCA : Near-optimal block placement and replication in distributed caches. / Hardavellas, Nikos; Ferdman, Michael; Falsafi, Babak; Ailamaki, Anastasia.

ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings. 2009. p. 184-195 (Proceedings - International Symposium on Computer Architecture).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Reactive NUCA

T2 - Near-optimal block placement and replication in distributed caches

AU - Hardavellas, Nikos

AU - Ferdman, Michael

AU - Falsafi, Babak

AU - Ailamaki, Anastasia

PY - 2009/11/30

Y1 - 2009/11/30

N2 - Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

AB - Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multi-programmed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

KW - Design

KW - Experimentation

KW - Performance

UR - http://www.scopus.com/inward/record.url?scp=70350601187&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350601187&partnerID=8YFLogxK

U2 - 10.1145/1555754.1555779

DO - 10.1145/1555754.1555779

M3 - Conference contribution

AN - SCOPUS:70350601187

SN - 9781605585260

T3 - Proceedings - International Symposium on Computer Architecture

SP - 184

EP - 195

BT - ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings

ER -

Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In ISCA 2009 - 36th Annual International Symposium on Computer Architecture, Conference Proceedings. 2009. p. 184-195. (Proceedings - International Symposium on Computer Architecture). https://doi.org/10.1145/1555754.1555779