COOLR: A New System for Dynamic Thermal-Aware Computing

Project: Research project

Project Details

Description

COOLR: A New System for Dynamic Thermal-Aware Computing Lead Institution: Argonne National Laboratory PI: Pete Beckman, [email protected], (630) 252-9020 Co-PIs and Senior Personnel: Northwestern University: Seda Ogrenci-Memik, Gokhan Memik Argonne National Laboratory: Kazutomo Yoshii Abstract Extreme-scale systems need to walk a fine line between the amount of cooling they receive and the thermal-induced performance and reliability degradation they can sustain. System managers are extremely motivated to battle cooling-energy cost: the largest line item in their total operating cost. On the other hand, pushing the nodes to peak performance in tightened cooling regimes places the burden on the dynamic thermal management (DTM) to protect the hardware. DTM schemes throttle performance to relieve heat accumulation within nodes when cooling cannot mitigate the problem. These interventions prevent fatal failures but introduce inevitable performance degradations and variations. Such variations can have drastic consequences on the performance of extreme-scale systems. Furthermore, distribution of heat across different system nodes depends not only on the amount of workload. Even if a strictly equal share of computational load is assigned to all nodes, there are physical attributes and topological features of each system that inherently cause uneven accumulation of heat. These can cause one subcomponent to trigger DTM prematurely and penalize the overall system. Our goal in this project is to create a holistic thermal-aware view of the system, capturing both inherent physical attributes and dynamic system state.We propose to develop COOLR, a dynamic system with the ability to evaluate the interplay between management of computation, data, power dissipation, and thermal state. Thereby, COOLR will achieve higher overall performance at the same energy and cooling cost. Specific objectives of this proposal can be categorized into two main directions. First, we will perform a systematic power and thermal modeling and of highperformance computing architectures. Novel thermal instrumentation techniques will be investigated as part of this effort. Also, analytical and empirical techniques will be blended to generate a light-weight thermal model of target systems. Second, we will develop a thermal-aware OS and runtime system. Policies for managing resources and schedules for processes and memory hierarchy will be designed to co-optimize the thermal state and performance of the system. The main outcome of our proposed thermal-aware dynamic computing system will be scheduling and allocation of processes and memory accesses leading to an appropriately “skewed” distribution of activity across the system. The ultimate result of this new arrangement will be reduced occurrences of hotspots and reduced episodes of DTM intervention, leading to improvements in performance. The overall thermal characterization and modeling methodology (thermal instrumentation techniques, model building, and systematic model reduction) resulting from this project can be applied to systems beyond our immediate scope and will be applicable to future scaling. The thermal-aware dynamic computing paradigm will impact the efficiency of extreme-scale systems. The gained “thermal slack” can be given back to further tighten the cooling budget, benefiting the management cost of next generation extreme-scale systems.
StatusFinished
Effective start/end date9/1/148/31/17

Funding

  • Department of Energy (DE-SC0012531)

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.