TY - GEN
T1 - Paths to fast barrier synchronization on the node
AU - Hetland, Conor
AU - Tziantzioulis, Georgios
AU - Suchy, Brian
AU - Leonard, Michael
AU - Han, Jin
AU - Albers, John
AU - Hardavellas, Nikos
AU - Dinda, Peter
PY - 2019/6/17
Y1 - 2019/6/17
N2 - Synchronization primitives like barriers heavily impact the performance of parallel programs. As core counts increase and granularity decreases, the value of enabling fast barriers increases. Through the evaluation of the performance of a variety of software implementations of barriers, we found the cost of software barriers to be on the order of tens of thousands of cycles on various incarnations of x64 hardware. We argue that reducing the latency of a barrier via hardware support will dramatically improve the performance of existing applications and runtimes, and would enable new execution models, including those which currently do not perform well on multicore machines. To support our argument, we first present the design, implementation, and evaluation of a barrier on the Intel HARP, a prototype that integrates an x64 processor and FPGA in the same package. This effort gives insight into the potential speed and compactness of hardware barriers, and suggests useful improvements to the HARP platform. Next, we turn to the processor itself and describe an x64 ISA extension for barriers, and how it could be implemented in the microarchitecture with minimal collateral changes. This design allows for barriers to be securely managed jointly between the OS and the application. Finally, we speculate on how barrier synchronization might be implemented on future photonics-based hardware.
AB - Synchronization primitives like barriers heavily impact the performance of parallel programs. As core counts increase and granularity decreases, the value of enabling fast barriers increases. Through the evaluation of the performance of a variety of software implementations of barriers, we found the cost of software barriers to be on the order of tens of thousands of cycles on various incarnations of x64 hardware. We argue that reducing the latency of a barrier via hardware support will dramatically improve the performance of existing applications and runtimes, and would enable new execution models, including those which currently do not perform well on multicore machines. To support our argument, we first present the design, implementation, and evaluation of a barrier on the Intel HARP, a prototype that integrates an x64 processor and FPGA in the same package. This effort gives insight into the potential speed and compactness of hardware barriers, and suggests useful improvements to the HARP platform. Next, we turn to the processor itself and describe an x64 ISA extension for barriers, and how it could be implemented in the microarchitecture with minimal collateral changes. This design allows for barriers to be securely managed jointly between the OS and the application. Finally, we speculate on how barrier synchronization might be implemented on future photonics-based hardware.
KW - Collective communication
KW - HPC
KW - Parallel computing
KW - Synchronization
UR - http://www.scopus.com/inward/record.url?scp=85069156763&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069156763&partnerID=8YFLogxK
U2 - 10.1145/3307681.3325402
DO - 10.1145/3307681.3325402
M3 - Conference contribution
T3 - HPDC 2019- Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
SP - 109
EP - 120
BT - HPDC 2019- Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 28th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019
Y2 - 22 June 2019 through 29 June 2019
ER -