TY - GEN
T1 - ACCLAiM
T2 - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
AU - Wilkins, Michael
AU - Guo, Yanfei
AU - Thakur, Rajeev
AU - Dinda, Peter
AU - Hardavellas, Nikos
N1 - Funding Information:
IX. ACKNOWLEDGMENTS This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357, and by the U.S. National Science Foundation via award CCF-2119069.
Funding Information:
This research used Bebop, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory, and the resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - MPI collective communication is an omnipresent communication model for high-performance computing (HPC) systems. The performance of a collective operation depends strongly on the algorithm used to implement it. MPI libraries use inaccurate heuristics to select these algorithms, causing applications to suffer unnecessary slowdowns. Machine learning (ML)-based autotuners are a promising alternative. ML autotuners can intelligently select algorithms for individual jobs, resulting in near-optimal performance. However, these approaches currently spend more time training than they save by accelerating applications, rendering them impractical. We make the case that ML-based collective algorithm selection autotuners can be made practical and accelerate production applications on large-scale supercomputers. We identify multiple impracticalities in the existing work, such as inefficient training point selection and ignoring non-power-of-two feature values. We address these issues through variance-based point selection and model testing alongside topology-aware benchmark paral-lelization. Our approach minimizes training time by eliminating unnecessary training points and maximizing machine utilization. We incorporate our improvements in a prototype active learning system, ACCLAiM (Advancing Collective Communication (L) Autotuning using Machine Learning). We show that each of ACCLAiM's advancements significantly reduces training time compared with the best existing machine learning approach. Then we apply ACCLAiM on a leadership-class supercomputer and demonstrate the conditions where ACCLAiM can accelerate HPC applications, proving the advantage of ML autotuners in a production setting for the first time.
AB - MPI collective communication is an omnipresent communication model for high-performance computing (HPC) systems. The performance of a collective operation depends strongly on the algorithm used to implement it. MPI libraries use inaccurate heuristics to select these algorithms, causing applications to suffer unnecessary slowdowns. Machine learning (ML)-based autotuners are a promising alternative. ML autotuners can intelligently select algorithms for individual jobs, resulting in near-optimal performance. However, these approaches currently spend more time training than they save by accelerating applications, rendering them impractical. We make the case that ML-based collective algorithm selection autotuners can be made practical and accelerate production applications on large-scale supercomputers. We identify multiple impracticalities in the existing work, such as inefficient training point selection and ignoring non-power-of-two feature values. We address these issues through variance-based point selection and model testing alongside topology-aware benchmark paral-lelization. Our approach minimizes training time by eliminating unnecessary training points and maximizing machine utilization. We incorporate our improvements in a prototype active learning system, ACCLAiM (Advancing Collective Communication (L) Autotuning using Machine Learning). We show that each of ACCLAiM's advancements significantly reduces training time compared with the best existing machine learning approach. Then we apply ACCLAiM on a leadership-class supercomputer and demonstrate the conditions where ACCLAiM can accelerate HPC applications, proving the advantage of ML autotuners in a production setting for the first time.
KW - autotuning machine learning
KW - collective communication
KW - MPI
UR - http://www.scopus.com/inward/record.url?scp=85140920196&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140920196&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER51413.2022.00030
DO - 10.1109/CLUSTER51413.2022.00030
M3 - Conference contribution
AN - SCOPUS:85140920196
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 161
EP - 171
BT - Proceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 September 2022 through 9 September 2022
ER -