Abstract
According to recent performance analyses, MPI collective operations make up a quarter of the execution time on production systems. Machine learning (ML) autotuners use supervised learning to select collective algorithms, significantly improving collective performance. However, we observe two barriers preventing their adoption over the default heuristic-based autotuners. First, a user may find it difficult to compare autotuners because we lack a methodology to quantify their performance. We call this the performance quantification challenge. Second, to obtain the advertised performance, ML model training requires benchmark data from a vast majority of the feature space. Collecting such data regularly on large scale systems consumes far too much time and resources, and this will only get worse with exascale systems. We refer to this as the training data collection challenge. To address these challenges, we contribute (1) a performance evaluation framework to compare and improve collective au-Totuner designs and (2) the Feature scaling, Active learning, Converge, Tune hyperparameters (FACT) approach, a three-part methodology to minimize the training data collection time (and thus maximize practicality at larger scale) without sacrificing accuracy. In the methodology, we first preprocess feature and output values based on domain knowledge. Then, we use active learning to iteratively collect only necessary training data points. Lastly, we perform hyperparameter tuning to further improve model accuracy without any additional data. On a production scale system, our methodology produces a model of equal accuracy using 6.88x less training data collection time.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of ExaMPI 2021 |
Subtitle of host publication | Workshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 36-45 |
Number of pages | 10 |
ISBN (Electronic) | 9781665411080 |
DOIs | |
State | Published - 2021 |
Event | 2021 Workshop on Exascale MPI, ExaMPI 2021 - St. Louis, United States Duration: Nov 14 2021 → … |
Publication series
Name | Proceedings of ExaMPI 2021: Workshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 2021 Workshop on Exascale MPI, ExaMPI 2021 |
---|---|
Country/Territory | United States |
City | St. Louis |
Period | 11/14/21 → … |
Funding
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357.
Keywords
- MPI
- collective communication
- machine learning
ASJC Scopus subject areas
- Hardware and Architecture
- Software
- Artificial Intelligence
- Computer Networks and Communications