ACCLAiM: Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning

Michael Wilkins, Yanfei Guo, Rajeev Thakur, Peter Dinda, Nikos Hardavellas

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

MPI collective communication is an omnipresent communication model for high-performance computing (HPC) systems. The performance of a collective operation depends strongly on the algorithm used to implement it. MPI libraries use inaccurate heuristics to select these algorithms, causing applications to suffer unnecessary slowdowns. Machine learning (ML)-based autotuners are a promising alternative. ML autotuners can intelligently select algorithms for individual jobs, resulting in near-optimal performance. However, these approaches currently spend more time training than they save by accelerating applications, rendering them impractical. We make the case that ML-based collective algorithm selection autotuners can be made practical and accelerate production applications on large-scale supercomputers. We identify multiple impracticalities in the existing work, such as inefficient training point selection and ignoring non-power-of-two feature values. We address these issues through variance-based point selection and model testing alongside topology-aware benchmark paral-lelization. Our approach minimizes training time by eliminating unnecessary training points and maximizing machine utilization. We incorporate our improvements in a prototype active learning system, ACCLAiM (Advancing Collective Communication (L) Autotuning using Machine Learning). We show that each of ACCLAiM's advancements significantly reduces training time compared with the best existing machine learning approach. Then we apply ACCLAiM on a leadership-class supercomputer and demonstrate the conditions where ACCLAiM can accelerate HPC applications, proving the advantage of ML autotuners in a production setting for the first time.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages161-171
Number of pages11
ISBN (Electronic)9781665498562
DOIs
StatePublished - 2022
Event2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 - Heidelberg, Germany
Duration: Sep 6 2022Sep 9 2022

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2022-September
ISSN (Print)1552-5244

Conference

Conference2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
Country/TerritoryGermany
CityHeidelberg
Period9/6/229/9/22

Funding

IX. ACKNOWLEDGMENTS This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357, and by the U.S. National Science Foundation via award CCF-2119069. This research used Bebop, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory, and the resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Keywords

  • MPI
  • autotuning machine learning
  • collective communication

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'ACCLAiM: Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning'. Together they form a unique fingerprint.

Cite this