Brief announcement: Parallel depth first vs. work stealing schedulers on CMP architectures

Vasileios Liaskovitis*, Shimin Chen, Phillip B. Gibbons, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Michael Kozuch, Todd C. Mowry, Chris Wilkerson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach. Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets. CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private LI caches and a shared L2 cache on chip. For a fixed die size (240 mm 2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm. Summary of findings. We studied a variety of benchmark programs to show the following findings. For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 1341% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches. For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above. Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF. We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs.

Original languageEnglish (US)
Title of host publicationSPAA 2006
Subtitle of host publication18th Annual ACM Symposium on Parallelism in Algorithms and Architectures
Number of pages1
Volume2006
StatePublished - Oct 16 2006
EventSPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures - Cambridge, MA, United States
Duration: Jul 30 2006Aug 2 2006

Other

OtherSPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures
CountryUnited States
CityCambridge, MA
Period7/30/068/2/06

Fingerprint

Multiprogramming
Scheduling
Bandwidth
Processing
Telecommunication traffic
Electric power utilization
Switches

Keywords

  • Caches
  • Chip Multiprocessors
  • Scheduling

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Liaskovitis, V., Chen, S., Gibbons, P. B., Ailamaki, A., Blelloch, G. E., Falsafi, B., ... Wilkerson, C. (2006). Brief announcement: Parallel depth first vs. work stealing schedulers on CMP architectures. In SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures (Vol. 2006)
Liaskovitis, Vasileios ; Chen, Shimin ; Gibbons, Phillip B. ; Ailamaki, Anastassia ; Blelloch, Guy E. ; Falsafi, Babak ; Fix, Limor ; Hardavellas, Nikos ; Kozuch, Michael ; Mowry, Todd C. ; Wilkerson, Chris. / Brief announcement : Parallel depth first vs. work stealing schedulers on CMP architectures. SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures. Vol. 2006 2006.
@inproceedings{d4a41bb02f16484e9af0c9e0ff8940d9,
title = "Brief announcement: Parallel depth first vs. work stealing schedulers on CMP architectures",
abstract = "In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach. Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets. CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private LI caches and a shared L2 cache on chip. For a fixed die size (240 mm 2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm. Summary of findings. We studied a variety of benchmark programs to show the following findings. For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 1341{\%} reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches. For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above. Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF. We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs.",
keywords = "Caches, Chip Multiprocessors, Scheduling",
author = "Vasileios Liaskovitis and Shimin Chen and Gibbons, {Phillip B.} and Anastassia Ailamaki and Blelloch, {Guy E.} and Babak Falsafi and Limor Fix and Nikos Hardavellas and Michael Kozuch and Mowry, {Todd C.} and Chris Wilkerson",
year = "2006",
month = "10",
day = "16",
language = "English (US)",
isbn = "1595934529",
volume = "2006",
booktitle = "SPAA 2006",

}

Liaskovitis, V, Chen, S, Gibbons, PB, Ailamaki, A, Blelloch, GE, Falsafi, B, Fix, L, Hardavellas, N, Kozuch, M, Mowry, TC & Wilkerson, C 2006, Brief announcement: Parallel depth first vs. work stealing schedulers on CMP architectures. in SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures. vol. 2006, SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, Cambridge, MA, United States, 7/30/06.

Brief announcement : Parallel depth first vs. work stealing schedulers on CMP architectures. / Liaskovitis, Vasileios; Chen, Shimin; Gibbons, Phillip B.; Ailamaki, Anastassia; Blelloch, Guy E.; Falsafi, Babak; Fix, Limor; Hardavellas, Nikos; Kozuch, Michael; Mowry, Todd C.; Wilkerson, Chris.

SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures. Vol. 2006 2006.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Brief announcement

T2 - Parallel depth first vs. work stealing schedulers on CMP architectures

AU - Liaskovitis, Vasileios

AU - Chen, Shimin

AU - Gibbons, Phillip B.

AU - Ailamaki, Anastassia

AU - Blelloch, Guy E.

AU - Falsafi, Babak

AU - Fix, Limor

AU - Hardavellas, Nikos

AU - Kozuch, Michael

AU - Mowry, Todd C.

AU - Wilkerson, Chris

PY - 2006/10/16

Y1 - 2006/10/16

N2 - In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach. Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets. CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private LI caches and a shared L2 cache on chip. For a fixed die size (240 mm 2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm. Summary of findings. We studied a variety of benchmark programs to show the following findings. For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 1341% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches. For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above. Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF. We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs.

AB - In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach. Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets. CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private LI caches and a shared L2 cache on chip. For a fixed die size (240 mm 2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm. Summary of findings. We studied a variety of benchmark programs to show the following findings. For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 1341% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches. For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above. Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF. We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs.

KW - Caches

KW - Chip Multiprocessors

KW - Scheduling

UR - http://www.scopus.com/inward/record.url?scp=33749545826&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33749545826&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33749545826

SN - 1595934529

SN - 9781595934529

VL - 2006

BT - SPAA 2006

ER -

Liaskovitis V, Chen S, Gibbons PB, Ailamaki A, Blelloch GE, Falsafi B et al. Brief announcement: Parallel depth first vs. work stealing schedulers on CMP architectures. In SPAA 2006: 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures. Vol. 2006. 2006