Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components

Kaicheng Zhang, Akhil Guliani, Seda Ogrenci Memik, Gokhan Memik, Kazutomo Yoshii, Rajesh Sankaran, Pete Beckman

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4.2 C to 2.9 C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2.9 C and 3.8 C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11.9 C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Finally, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.

Original languageEnglish (US)
Article number7995115
Pages (from-to)405-419
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume29
Issue number2
DOIs
StatePublished - Feb 1 2018

Fingerprint

Temperature control
Learning systems
Temperature
Linear regression
Neural networks
Degradation
Cooling systems
Hot Temperature
Feature extraction

Keywords

  • Thermal modeling
  • high performance computing systems
  • many-core processors
  • operating systems

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Zhang, Kaicheng ; Guliani, Akhil ; Memik, Seda Ogrenci ; Memik, Gokhan ; Yoshii, Kazutomo ; Sankaran, Rajesh ; Beckman, Pete. / Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components. In: IEEE Transactions on Parallel and Distributed Systems. 2018 ; Vol. 29, No. 2. pp. 405-419.
@article{5d51cdf08e9549e5a7875538566885e7,
title = "Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components",
abstract = "Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4.2 C to 2.9 C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2.9 C and 3.8 C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11.9 C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Finally, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.",
keywords = "Thermal modeling, high performance computing systems, many-core processors, operating systems",
author = "Kaicheng Zhang and Akhil Guliani and Memik, {Seda Ogrenci} and Gokhan Memik and Kazutomo Yoshii and Rajesh Sankaran and Pete Beckman",
year = "2018",
month = "2",
day = "1",
doi = "10.1109/TPDS.2017.2732951",
language = "English (US)",
volume = "29",
pages = "405--419",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "2",

}

Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components. / Zhang, Kaicheng; Guliani, Akhil; Memik, Seda Ogrenci; Memik, Gokhan; Yoshii, Kazutomo; Sankaran, Rajesh; Beckman, Pete.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 29, No. 2, 7995115, 01.02.2018, p. 405-419.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components

AU - Zhang, Kaicheng

AU - Guliani, Akhil

AU - Memik, Seda Ogrenci

AU - Memik, Gokhan

AU - Yoshii, Kazutomo

AU - Sankaran, Rajesh

AU - Beckman, Pete

PY - 2018/2/1

Y1 - 2018/2/1

N2 - Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4.2 C to 2.9 C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2.9 C and 3.8 C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11.9 C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Finally, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.

AB - Elevated temperatures limit the peak performance of systems because of frequent interventions by thermal throttling. Non-uniform thermal states across system nodes also cause performance variation within seemingly equivalent nodes leading to significant degradation of overall performance. In this paper we present a framework for creating a lightweight thermal prediction system suitable for run-time management decisions. We pursue two avenues to explore optimized lightweight thermal predictors. First, we use feature selection algorithms to improve the performance of previously designed machine learning methods. Second, we develop alternative methods using neural network and linear regression-based methods to perform a comprehensive comparative study of prediction methods. We show that our optimized models achieve improved performance with better prediction accuracy and lower overhead as compared with the Gaussian process model proposed previously. Specifically we present a reduced version of the Gaussian process model, a neural network-based model, and a linear regression-based model. Using the optimization methods, we are able to reduce the average prediction errors in the Gaussian process from 4.2 C to 2.9 C. We also show that the newly developed models using neural network and Lasso linear regression have average prediction errors of 2.9 C and 3.8 C respectively. The prediction overheads are 0.22, 0.097, and 0.026 ms per prediction for reduced Gaussian process, neural network, and Lasso linear regression models, respectively, compared with 0.57 ms per prediction for the previous Gaussian process model. We have implemented our proposed thermal prediction models on a two-node system configuration to help identify the optimal task placement. The task placement identified by the models reduces the average system temperature by up to 11.9 C without any performance degradation. Furthermore, these models respectively achieve 75, 82.5, and 74.17 percent success rates in correctly pointing to those task placements with better thermal response, compared with 72.5 percent success for the original model in achieving the same objective. Finally, we extended our analysis to a 16-node system and we were able to train models and execute them in real time to guide task migration and achieve on average 17 percent reduction in the overall system cooling power.

KW - Thermal modeling

KW - high performance computing systems

KW - many-core processors

KW - operating systems

UR - http://www.scopus.com/inward/record.url?scp=85029161627&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029161627&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2017.2732951

DO - 10.1109/TPDS.2017.2732951

M3 - Article

VL - 29

SP - 405

EP - 419

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 2

M1 - 7995115

ER -