TY - GEN
T1 - Towards performance and scalability analysis of distributed memory programs on large-scale clusters
AU - Medya, Sourav
AU - Cherkasova, Ludmila
AU - Magalhaes, Guilherme
AU - Ozonat, Kivanc
AU - Padmanabha, Chaitra
AU - Sarma, Jiban
AU - Sheikh, Imran
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/3/12
Y1 - 2016/3/12
N2 - Many HPC and modern Big Data processing applications belong to a class of so-called scale-out applications, where the application dataset is partitioned and processed by a cluster of machines. Understanding and assessing the scalability of the designed application is one of the primary goals during the application implementation. Typically, in the design and implementation phase, the programmer is bound to a limited size cluster for debugging and performing profiling experiments. The challenge is to assess the scalability of the designed program for its execution on a larger cluster. While in an increased size cluster, each node needs to process a smaller fraction of the original dataset, the communication volume and communication time might be significantly increased, which could become detrimental and provide diminishing performance benefits. The distributed memory applications exhibit complex behavior: they tend to interleave computations and communications, use bursty transfers, and utilize global synchronization primitives. Therefore, one of the main challenges is the analysis of bandwidth demands due to increased communication volume as a function of a cluster size. In this paper, we introduce a novel approach to assess the scalability and performance of a distributed memory program for execution on a large-scale cluster. Our solution involves 1) a limited set of traditional experiments performed in a medium size cluster and 2) an additional set of similar experiments performed with an "interconnect bandwidth throttling" tool, which enables the assessment of the communication demands with respect to available bandwidth. This approach enables a prediction of a cluster size, where a communication cost becomes a dominant component, at which point the performance benefits of the increased cluster lead to a diminishing return. We demonstrate the proposed approach using a popular Graph500 benchmark.
AB - Many HPC and modern Big Data processing applications belong to a class of so-called scale-out applications, where the application dataset is partitioned and processed by a cluster of machines. Understanding and assessing the scalability of the designed application is one of the primary goals during the application implementation. Typically, in the design and implementation phase, the programmer is bound to a limited size cluster for debugging and performing profiling experiments. The challenge is to assess the scalability of the designed program for its execution on a larger cluster. While in an increased size cluster, each node needs to process a smaller fraction of the original dataset, the communication volume and communication time might be significantly increased, which could become detrimental and provide diminishing performance benefits. The distributed memory applications exhibit complex behavior: they tend to interleave computations and communications, use bursty transfers, and utilize global synchronization primitives. Therefore, one of the main challenges is the analysis of bandwidth demands due to increased communication volume as a function of a cluster size. In this paper, we introduce a novel approach to assess the scalability and performance of a distributed memory program for execution on a large-scale cluster. Our solution involves 1) a limited set of traditional experiments performed in a medium size cluster and 2) an additional set of similar experiments performed with an "interconnect bandwidth throttling" tool, which enables the assessment of the communication demands with respect to available bandwidth. This approach enables a prediction of a cluster size, where a communication cost becomes a dominant component, at which point the performance benefits of the increased cluster lead to a diminishing return. We demonstrate the proposed approach using a popular Graph500 benchmark.
UR - http://www.scopus.com/inward/record.url?scp=85019445725&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85019445725&partnerID=8YFLogxK
U2 - 10.1145/2851553.2858669
DO - 10.1145/2851553.2858669
M3 - Conference contribution
AN - SCOPUS:85019445725
T3 - ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering
SP - 113
EP - 116
BT - ICPE 2016 - Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering
PB - Association for Computing Machinery, Inc
T2 - 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016
Y2 - 12 March 2016 through 16 March 2016
ER -