COOLR: A New System for Dynamic Thermal-Aware Computing

Project: Research project

Description

COOLR: A New System for Dynamic Thermal-Aware Computing
Lead Institution: Argonne National Laboratory
PI: Pete Beckman, beckman@mcs.anl.gov, (630) 252-9020
Co-PIs and Senior Personnel:
Northwestern University: Seda Ogrenci-Memik, Gokhan Memik
Argonne National Laboratory: Kazutomo Yoshii
Abstract
Extreme-scale systems need to walk a fine line between the amount of cooling they receive and
the thermal-induced performance and reliability degradation they can sustain. System managers
are extremely motivated to battle cooling-energy cost: the largest line item in their total operating
cost. On the other hand, pushing the nodes to peak performance in tightened cooling regimes
places the burden on the dynamic thermal management (DTM) to protect the hardware. DTM
schemes throttle performance to relieve heat accumulation within nodes when cooling cannot mitigate
the problem. These interventions prevent fatal failures but introduce inevitable performance
degradations and variations. Such variations can have drastic consequences on the performance of
extreme-scale systems. Furthermore, distribution of heat across different system nodes depends not
only on the amount of workload. Even if a strictly equal share of computational load is assigned to
all nodes, there are physical attributes and topological features of each system that inherently cause
uneven accumulation of heat. These can cause one subcomponent to trigger DTM prematurely and
penalize the overall system.
Our goal in this project is to create a holistic thermal-aware view of the system, capturing both
inherent physical attributes and dynamic system state.We propose to develop COOLR, a dynamic
system with the ability to evaluate the interplay between management of computation, data, power
dissipation, and thermal state. Thereby, COOLR will achieve higher overall performance at the
same energy and cooling cost. Specific objectives of this proposal can be categorized into two
main directions. First, we will perform a systematic power and thermal modeling and of highperformance
computing architectures. Novel thermal instrumentation techniques will be investigated
as part of this effort. Also, analytical and empirical techniques will be blended to generate
a light-weight thermal model of target systems. Second, we will develop a thermal-aware OS and
runtime system. Policies for managing resources and schedules for processes and memory hierarchy
will be designed to co-optimize the thermal state and performance of the system.
The main outcome of our proposed thermal-aware dynamic computing system will be scheduling
and allocation of processes and memory accesses leading to an appropriately “skewed” distribution
of activity across the system. The ultimate result of this new arrangement will be reduced
occurrences of hotspots and reduced episodes of DTM intervention, leading to improvements in
performance. The overall thermal characterization and modeling methodology (thermal instrumentation
techniques, model building, and systematic model reduction) resulting from this project can
be applied to systems beyond our immediate scope and will be applicable to future scaling. The
thermal-aware dynamic computing paradigm will impact the efficiency of extreme-scale systems.
The gained “thermal slack” can be given back to further tighten the cooling budget, benefiting the
management cost of next generation extreme-scale systems.
StatusFinished
Effective start/end date9/1/148/31/17

Funding

  • Department of Energy (DE-SC0012531)

Fingerprint

Cooling
Hot Temperature
Data storage equipment
Costs
Dynamical systems
Personnel
Hardware
Degradation