Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using exible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU.