TY - GEN
T1 - Scheduling threads for constructive cache sharing on CMPs
AU - Chen, Shimin
AU - Gibbons, Phillip B.
AU - Kozuch, Michael
AU - Liaskovitis, Vasileios
AU - Ailamaki, Anastassia
AU - Blelloch, Guy E.
AU - Falsafi, Babak
AU - Fix, Limor
AU - Hardavellas, Nikos
AU - Mowry, Todd C.
AU - Wilkerson, Chris
PY - 2007
Y1 - 2007
N2 - In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
AB - In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
KW - Chip multiprocessors
KW - Constructive cache sharing
KW - Parallel depth first
KW - Scheduling algorithms
KW - Thread granularity
KW - Work stealing
KW - Working set profiling
UR - http://www.scopus.com/inward/record.url?scp=35248852476&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35248852476&partnerID=8YFLogxK
U2 - 10.1145/1248377.1248396
DO - 10.1145/1248377.1248396
M3 - Conference contribution
AN - SCOPUS:35248852476
SN - 159593667X
SN - 9781595936677
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 105
EP - 115
BT - SPAA'07
T2 - SPAA'07: 19th Annual Symposium on Parallelism in Algorithms and Architectures
Y2 - 9 June 2007 through 11 June 2007
ER -