A FACT-based Approach: Making Machine Learning Collective Autotuning Feasible on Exascale Systems

Michael Wilkins, Yanfei Guo, Rajeev Thakur, Nikos Hardavellas, Peter Dinda, Min Si

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

According to recent performance analyses, MPI collective operations make up a quarter of the execution time on production systems. Machine learning (ML) autotuners use supervised learning to select collective algorithms, significantly improving collective performance. However, we observe two barriers preventing their adoption over the default heuristic-based autotuners. First, a user may find it difficult to compare autotuners because we lack a methodology to quantify their performance. We call this the performance quantification challenge. Second, to obtain the advertised performance, ML model training requires benchmark data from a vast majority of the feature space. Collecting such data regularly on large scale systems consumes far too much time and resources, and this will only get worse with exascale systems. We refer to this as the training data collection challenge. To address these challenges, we contribute (1) a performance evaluation framework to compare and improve collective au-Totuner designs and (2) the Feature scaling, Active learning, Converge, Tune hyperparameters (FACT) approach, a three-part methodology to minimize the training data collection time (and thus maximize practicality at larger scale) without sacrificing accuracy. In the methodology, we first preprocess feature and output values based on domain knowledge. Then, we use active learning to iteratively collect only necessary training data points. Lastly, we perform hyperparameter tuning to further improve model accuracy without any additional data. On a production scale system, our methodology produces a model of equal accuracy using 6.88x less training data collection time.

Original languageEnglish (US)
Title of host publicationProceedings of ExaMPI 2021
Subtitle of host publicationWorkshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages36-45
Number of pages10
ISBN (Electronic)9781665411080
DOIs
StatePublished - 2021
Event2021 Workshop on Exascale MPI, ExaMPI 2021 - St. Louis, United States
Duration: Nov 14 2021 → …

Publication series

NameProceedings of ExaMPI 2021: Workshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2021 Workshop on Exascale MPI, ExaMPI 2021
Country/TerritoryUnited States
CitySt. Louis
Period11/14/21 → …

Keywords

  • MPI
  • collective communication
  • machine learning

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Artificial Intelligence
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A FACT-based Approach: Making Machine Learning Collective Autotuning Feasible on Exascale Systems'. Together they form a unique fingerprint.

Cite this