TY - JOUR
T1 - Thermal Management for FPGA Nodes in HPC Systems
AU - Luo, Yingyi
AU - Zhao, Joshua C.
AU - Aggarwal, Arnav
AU - Ogrenci-Memik, Seda
AU - Yoshii, Kazutomo
N1 - Funding Information:
This work was partly done when Arnav was an intern at Northwestern University. Results presented in this article were obtained using the Chameleon testbed supported by the National Science Foundation. This material was based upon work supported in part by the U.S. Department of Energy Office of Science, under contract DE-AC02-06CH11357. Authors’ addresses: Y. Luo, J. C. Zhao, and S. Ogrenci-Memik, Northwestern University, 2145 Sheridan Road, Evanston, Illinois, 60208; emails: yingyi.luo@eecs.northwestern.edu, joshuazhao2021@u.northwestern.edu, seda@eecs.northwestern. edu; A. Aggarwal, William Fremd High School, 1000 S Quentin Road, Palatine, Illinois, 60067; email: arnavaggarwal093@ gmail.com; K. Yoshii, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, Illinois, 60439; email: kazutomo@ mcs.anl.gov. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the United States government. As such, the United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes only. © 2020 Association for Computing Machinery. 1084-4309/2020/10-ART14 $15.00 https://doi.org/10.1145/3423494
Publisher Copyright:
© 2020 ACM.
PY - 2020/10
Y1 - 2020/10
N2 - The integration of FPGAs into large-scale computing systems is gaining attention. In these systems, real-time data handling for networking, tasks for scientific computing, and machine learning can be executed with customized datapaths on reconfigurable fabric within heterogeneous compute nodes. At the same time, thermal management, particularly battling the cooling cost and guaranteeing the reliability, is a continuing concern. The introduction of new heterogeneous components into HPC nodes only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is less explored. In this article, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning-based model to capture the thermal behavior of each individual FPGA in the cluster. We then propose two thermal management strategies guided by our thermal model. First, we mitigate thermal variation and hotspots across the cluster by proactive thermal-aware task placement. Under the tested system and benchmarks, we achieve up to 26.4° C and on average 13.3° C system temperature reduction with no performance penalty. Second, we utilize this thermal model to guide HLS parameter tuning at the task design stage to achieve improved thermal response after deployment.
AB - The integration of FPGAs into large-scale computing systems is gaining attention. In these systems, real-time data handling for networking, tasks for scientific computing, and machine learning can be executed with customized datapaths on reconfigurable fabric within heterogeneous compute nodes. At the same time, thermal management, particularly battling the cooling cost and guaranteeing the reliability, is a continuing concern. The introduction of new heterogeneous components into HPC nodes only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is less explored. In this article, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning-based model to capture the thermal behavior of each individual FPGA in the cluster. We then propose two thermal management strategies guided by our thermal model. First, we mitigate thermal variation and hotspots across the cluster by proactive thermal-aware task placement. Under the tested system and benchmarks, we achieve up to 26.4° C and on average 13.3° C system temperature reduction with no performance penalty. Second, we utilize this thermal model to guide HLS parameter tuning at the task design stage to achieve improved thermal response after deployment.
KW - Thermal modeling
KW - high performance computing
KW - task placement
KW - thermal-aware design
UR - http://www.scopus.com/inward/record.url?scp=85097578943&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097578943&partnerID=8YFLogxK
U2 - 10.1145/3423494
DO - 10.1145/3423494
M3 - Article
AN - SCOPUS:85097578943
SN - 1084-4309
VL - 26
JO - ACM Transactions on Design Automation of Electronic Systems
JF - ACM Transactions on Design Automation of Electronic Systems
IS - 2
M1 - 3423494
ER -