Scheduling threads for constructive cache sharing on CMPs

Shimin Chen*, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, Chris Wilkerson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

105 Citations (Scopus)

Abstract

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

Original languageEnglish (US)
Title of host publicationSPAA'07
Subtitle of host publicationProceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures
Pages105-115
Number of pages11
DOIs
StatePublished - Oct 18 2007
EventSPAA'07: 19th Annual Symposium on Parallelism in Algorithms and Architectures - San Diego, CA, United States
Duration: Jun 9 2007Jun 11 2007

Publication series

NameAnnual ACM Symposium on Parallelism in Algorithms and Architectures

Other

OtherSPAA'07: 19th Annual Symposium on Parallelism in Algorithms and Architectures
CountryUnited States
CitySan Diego, CA
Period6/9/076/11/07

Fingerprint

Scheduling
Bandwidth

Keywords

  • Chip multiprocessors
  • Constructive cache sharing
  • Parallel depth first
  • Scheduling algorithms
  • Thread granularity
  • Work stealing
  • Working set profiling

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Chen, S., Gibbons, P. B., Kozuch, M., Liaskovitis, V., Ailamaki, A., Blelloch, G. E., ... Wilkerson, C. (2007). Scheduling threads for constructive cache sharing on CMPs. In SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures (pp. 105-115). (Annual ACM Symposium on Parallelism in Algorithms and Architectures). https://doi.org/10.1145/1248377.1248396
Chen, Shimin ; Gibbons, Phillip B. ; Kozuch, Michael ; Liaskovitis, Vasileios ; Ailamaki, Anastassia ; Blelloch, Guy E. ; Falsafi, Babak ; Fix, Limor ; Hardavellas, Nikos ; Mowry, Todd C. ; Wilkerson, Chris. / Scheduling threads for constructive cache sharing on CMPs. SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures. 2007. pp. 105-115 (Annual ACM Symposium on Parallelism in Algorithms and Architectures).
@inproceedings{77bb644735764392b926b986972b8146,
title = "Scheduling threads for constructive cache sharing on CMPs",
abstract = "In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.",
keywords = "Chip multiprocessors, Constructive cache sharing, Parallel depth first, Scheduling algorithms, Thread granularity, Work stealing, Working set profiling",
author = "Shimin Chen and Gibbons, {Phillip B.} and Michael Kozuch and Vasileios Liaskovitis and Anastassia Ailamaki and Blelloch, {Guy E.} and Babak Falsafi and Limor Fix and Nikos Hardavellas and Mowry, {Todd C.} and Chris Wilkerson",
year = "2007",
month = "10",
day = "18",
doi = "10.1145/1248377.1248396",
language = "English (US)",
isbn = "159593667X",
series = "Annual ACM Symposium on Parallelism in Algorithms and Architectures",
pages = "105--115",
booktitle = "SPAA'07",

}

Chen, S, Gibbons, PB, Kozuch, M, Liaskovitis, V, Ailamaki, A, Blelloch, GE, Falsafi, B, Fix, L, Hardavellas, N, Mowry, TC & Wilkerson, C 2007, Scheduling threads for constructive cache sharing on CMPs. in SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures. Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 105-115, SPAA'07: 19th Annual Symposium on Parallelism in Algorithms and Architectures, San Diego, CA, United States, 6/9/07. https://doi.org/10.1145/1248377.1248396

Scheduling threads for constructive cache sharing on CMPs. / Chen, Shimin; Gibbons, Phillip B.; Kozuch, Michael; Liaskovitis, Vasileios; Ailamaki, Anastassia; Blelloch, Guy E.; Falsafi, Babak; Fix, Limor; Hardavellas, Nikos; Mowry, Todd C.; Wilkerson, Chris.

SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures. 2007. p. 105-115 (Annual ACM Symposium on Parallelism in Algorithms and Architectures).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Scheduling threads for constructive cache sharing on CMPs

AU - Chen, Shimin

AU - Gibbons, Phillip B.

AU - Kozuch, Michael

AU - Liaskovitis, Vasileios

AU - Ailamaki, Anastassia

AU - Blelloch, Guy E.

AU - Falsafi, Babak

AU - Fix, Limor

AU - Hardavellas, Nikos

AU - Mowry, Todd C.

AU - Wilkerson, Chris

PY - 2007/10/18

Y1 - 2007/10/18

N2 - In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

AB - In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

KW - Chip multiprocessors

KW - Constructive cache sharing

KW - Parallel depth first

KW - Scheduling algorithms

KW - Thread granularity

KW - Work stealing

KW - Working set profiling

UR - http://www.scopus.com/inward/record.url?scp=35248852476&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35248852476&partnerID=8YFLogxK

U2 - 10.1145/1248377.1248396

DO - 10.1145/1248377.1248396

M3 - Conference contribution

AN - SCOPUS:35248852476

SN - 159593667X

SN - 9781595936677

T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures

SP - 105

EP - 115

BT - SPAA'07

ER -

Chen S, Gibbons PB, Kozuch M, Liaskovitis V, Ailamaki A, Blelloch GE et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07: Proceedings of the Nineteenth Annual Symposium on Parallelism in Algorithms and Architectures. 2007. p. 105-115. (Annual ACM Symposium on Parallelism in Algorithms and Architectures). https://doi.org/10.1145/1248377.1248396