TY - GEN
T1 - Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures
AU - Nookala, Poornima
AU - Dinda, Peter
AU - Hale, Kyle C.
AU - Chard, Kyle
AU - Raicu, Ioan
N1 - Funding Information:
This work was supported in part by the National Science Foundation (NSF) under grants 2107548.
Funding Information:
This work was supported in part by the National Science Foundation (NSF) under grants 2107548/2107283, CCF-1757964, CNS-1730689, CNS-1763612, CNS-1718252, CCF-2028958, CCF-2028851, CNS-1763743 and CCF-2119069.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
AB - Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
KW - concurrent data structures
KW - fine-grained parallelism
KW - lock-free
KW - lock-less
KW - parallel runtime
KW - queues
KW - tasks
UR - http://www.scopus.com/inward/record.url?scp=85123180266&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123180266&partnerID=8YFLogxK
U2 - 10.1109/MASCOTS53633.2021.9614292
DO - 10.1109/MASCOTS53633.2021.9614292
M3 - Conference contribution
AN - SCOPUS:85123180266
T3 - Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
BT - Proceedings - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
PB - IEEE Computer Society
T2 - 29th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2021
Y2 - 3 November 2021 through 5 November 2021
ER -