TY - GEN
T1 - Optimizing large-scale graph analysis on multithreaded, multicore platforms
AU - Cong, Guojing
AU - Makarychev, Konstantin
PY - 2012
Y1 - 2012
N2 - The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact the performance of typical graph workload. As modern architectures strike different balances between caching and multithreading, it remains an open question whether the benefit of optimizing locality behavior outweighs the cost. We study parallel graph algorithms on two different multi-threaded, multi-core platforms, that is, IBM Power7 and Sun Niagara2. Our experiments first demonstrate their performance advantage over prior architectures. We find nonetheless the number of hardware threads in either platform is not sufficient to fully mask memory latency. Our cache-friendly scheduling of memory accesses improves performance by up to 2.6 times on Power7 and prior cache-based architectures, yet the same technique significantly degrades performance on Niagara2. Software prefetching and manipulating the storage of the input to improve spatial locality improve performance by up to 2.1 times and 1.3 times on both platforms. Our study reveals interesting interplay between architecture and algorithm.
AB - The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact the performance of typical graph workload. As modern architectures strike different balances between caching and multithreading, it remains an open question whether the benefit of optimizing locality behavior outweighs the cost. We study parallel graph algorithms on two different multi-threaded, multi-core platforms, that is, IBM Power7 and Sun Niagara2. Our experiments first demonstrate their performance advantage over prior architectures. We find nonetheless the number of hardware threads in either platform is not sufficient to fully mask memory latency. Our cache-friendly scheduling of memory accesses improves performance by up to 2.6 times on Power7 and prior cache-based architectures, yet the same technique significantly degrades performance on Niagara2. Software prefetching and manipulating the storage of the input to improve spatial locality improve performance by up to 2.1 times and 1.3 times on both platforms. Our study reveals interesting interplay between architecture and algorithm.
KW - Multi-threading
KW - Parallel Graph Algorithms
KW - Software Prefetch
KW - Traversal
UR - http://www.scopus.com/inward/record.url?scp=84866879704&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866879704&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2012.46
DO - 10.1109/IPDPS.2012.46
M3 - Conference contribution
AN - SCOPUS:84866879704
SN - 9780769546759
T3 - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
SP - 414
EP - 425
BT - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
T2 - 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Y2 - 21 May 2012 through 25 May 2012
ER -