Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes

Yingyi Luo, Xiaoyang Wang, Seda Ogrenci Memik, Gokhan Memik, Kazutomo Yoshii, Pete Beckman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The presence of FPGAS in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAS of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages537-544
Number of pages8
ISBN (Electronic)9781538684771
DOIs
StatePublished - Jan 16 2019
Event36th International Conference on Computer Design, ICCD 2018 - Orlando, United States
Duration: Oct 7 2018Oct 10 2018

Publication series

NameProceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018

Conference

Conference36th International Conference on Computer Design, ICCD 2018
CountryUnited States
CityOrlando
Period10/7/1810/10/18

Fingerprint

Field programmable gate arrays (FPGA)
Particle accelerators
Learning systems
Hot Temperature
Cooling
Costs
Experiments
Temperature

Keywords

  • HPC
  • Task Placement
  • Thermal Modeling

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality

Cite this

Luo, Y., Wang, X., Memik, S. O., Memik, G., Yoshii, K., & Beckman, P. (2019). Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes. In Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018 (pp. 537-544). [8615736] (Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCD.2018.00086
Luo, Yingyi ; Wang, Xiaoyang ; Memik, Seda Ogrenci ; Memik, Gokhan ; Yoshii, Kazutomo ; Beckman, Pete. / Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes. Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 537-544 (Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018).
@inproceedings{3db4d63d418d468f841c013c7876f263,
title = "Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes",
abstract = "The presence of FPGAS in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAS of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.",
keywords = "HPC, Task Placement, Thermal Modeling",
author = "Yingyi Luo and Xiaoyang Wang and Memik, {Seda Ogrenci} and Gokhan Memik and Kazutomo Yoshii and Pete Beckman",
year = "2019",
month = "1",
day = "16",
doi = "10.1109/ICCD.2018.00086",
language = "English (US)",
series = "Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "537--544",
booktitle = "Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018",
address = "United States",

}

Luo, Y, Wang, X, Memik, SO, Memik, G, Yoshii, K & Beckman, P 2019, Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes. in Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018., 8615736, Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018, Institute of Electrical and Electronics Engineers Inc., pp. 537-544, 36th International Conference on Computer Design, ICCD 2018, Orlando, United States, 10/7/18. https://doi.org/10.1109/ICCD.2018.00086

Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes. / Luo, Yingyi; Wang, Xiaoyang; Memik, Seda Ogrenci; Memik, Gokhan; Yoshii, Kazutomo; Beckman, Pete.

Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018. Institute of Electrical and Electronics Engineers Inc., 2019. p. 537-544 8615736 (Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes

AU - Luo, Yingyi

AU - Wang, Xiaoyang

AU - Memik, Seda Ogrenci

AU - Memik, Gokhan

AU - Yoshii, Kazutomo

AU - Beckman, Pete

PY - 2019/1/16

Y1 - 2019/1/16

N2 - The presence of FPGAS in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAS of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.

AB - The presence of FPGAS in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAS of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.

KW - HPC

KW - Task Placement

KW - Thermal Modeling

UR - http://www.scopus.com/inward/record.url?scp=85062238722&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062238722&partnerID=8YFLogxK

U2 - 10.1109/ICCD.2018.00086

DO - 10.1109/ICCD.2018.00086

M3 - Conference contribution

T3 - Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018

SP - 537

EP - 544

BT - Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Luo Y, Wang X, Memik SO, Memik G, Yoshii K, Beckman P. Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes. In Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 537-544. 8615736. (Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018). https://doi.org/10.1109/ICCD.2018.00086