TY - GEN
T1 - PeerWave
T2 - 29th ACM International Conference on Supercomputing, ICS 2015
AU - Belviranli, Mehmet E.
AU - Deng, Peng
AU - Bhuyan, Laxmi N.
AU - Gupta, Rajiv
AU - Zhu, Qi
N1 - Publisher Copyright:
© Copyright 2015 ACM.
PY - 2015/6/8
Y1 - 2015/6/8
N2 - Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using exible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU.
AB - Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using exible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU.
KW - Decentralized synchronization
KW - GP-GPU computing
KW - Wavefront parallelism
UR - http://www.scopus.com/inward/record.url?scp=84957543294&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84957543294&partnerID=8YFLogxK
U2 - 10.1145/2751205.2751243
DO - 10.1145/2751205.2751243
M3 - Conference contribution
AN - SCOPUS:84957543294
T3 - Proceedings of the International Conference on Supercomputing
SP - 25
EP - 35
BT - ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 8 June 2015 through 11 June 2015
ER -