Minimizing Thermal Variation Across System Components

Kaicheng Zhang, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Rajesh Sankaran, Pete Beckman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Scopus citations

Abstract

Thermal overheating is a serious concern in modern supercomputing systems. Elevated temperature levels reduce the reliability and the lifetime of the underlying hardware and increase their power consumption. Previous studies on mitigating thermal hotspots at the hardware and run-time system levels have typically used approaches that trade off performance for reduced operating temperatures. In this paper, we first show that in a large-scale system, physical attributes cause an uneven temperature distribution. We then develop a model to characterize the thermal behaviour of a complex system using various machine learning methods. We propose to improve application placement by incorporating thermal awareness into the decision-making process. Specifically, our system predicts the thermal condition of the system based on application mapping and uses these predictions to mitigate thermal hotspots without any performance loss. We provide two versions of our prediction mechanism. On a two-node configuration, these models achieve 72.5% and 78.8% success rates in their predictions, respectively. In other words, the scheduling decisions of our models result in a task placement that has a lower maximum average temperature. Overall, the more aggressive scheme reduces the average peak temperature by up to 11.9°C (2.3°C on average) without any performance degradation.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1139-1148
Number of pages10
ISBN (Electronic)9781479986484
DOIs
StatePublished - Jul 17 2015
Event29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015 - Hyderabad, India
Duration: May 25 2015May 29 2015

Publication series

NameProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Other

Other29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Country/TerritoryIndia
CityHyderabad
Period5/25/155/29/15

Keywords

  • Task scheduling
  • thermal model

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Minimizing Thermal Variation Across System Components'. Together they form a unique fingerprint.

Cite this