PROTEUS: Machine Learning Driven Resilience for Extreme-scale Systems

Project: Research project


PROTEUS: Machine Learning Driven Resilience for Extreme-scale Systems

Alok Choudhary, Northwestern University (Principal Investigator)
Ankit Agrawal, Northwestern University (Co-Investigator)
Wei-keng Liao, Northwestern University (Co-Investigator)


The challenges of extreme-scale computing systems exist across multiple dimensions including architecture, energy constraints, memory scaling, limited I/O, scalability of software and applications. In extreme-scale computing, the failure rates of components and systems are likely to increase significantly. It is clear that larger the simulations, greater the need for effective resiliency within the constraints posed by limited storage space, deeper and more complex memory hierarchies, and higher cost of data movement (particularly external to the systems) in terms of performance and energy. In the HPC domain, historically, checkpointing at the application level has been widely used to address the resiliency issue by periodically storing the state of simulations in files as a defensive mechanism. The cost of such
checkpointing includes lower utilization of systems, significant data movement to I/O devices and storage systems, more complex software development process, and increased power consumption due to data movement and storage. For extreme-scale systems, these costs can become prohibitive, and thus, frequency of checkpoints may need to reduce, while on the other hand dealing with increased failure rate.
As HPC systems scale, data amounts produced by applications will increase dramatically.
Consequently, the traditional models of storing raw uncompressed data as a checkpoint will become cost prohibitive. On the other hand, it will remain necessary to store the states of simulations for the postanalysis purpose. Existing lossy compression can help somewhat in reducing data sizes, but the error rates are not easy to confine, and fidelity of data cannot be maintained with guarantees to achieve effective reductions in data sizes. For large scale applications significant deviation from the original values can impact the outcome of the simulations. Thus brute-force solutions for resiliency and checkpointing, particularly those not accounting for constraints posed by extreme-scale systems are unlikely to succeed.
In this proposal we pose the following challenging question – β€œCan we develop techniques that provide improved resilience via checkpoints, while reducing the amount of data to be stored by an order of magnitude; guaranteeing a user-specified tolerable maximum error rate for each data point and an order of magnitude smaller mean error for entire data set; and reduced I/O time by an order of magnitude, enabling faster restart and faster convergence after restart, while providing data for effective analysis and visualization as an additional benefit?”

The overarching goal of this proposal is to design, develop, and evaluate scalable algorithms, software, libraries to enable enhanced resilience, efficient checkpointing, program restart, along with the ability to dual-use the data for detailed analysis. The proposed tasks are to 1) design and develop scalable machine learning based techniques to learn temporal change patterns in a scalable and in-situ manner, and to minimize data movement and maximize learning locally closest to data; 2) design a concise data representation and indexing mechanism to capture the distribution of changes in data that can guarantee
point-wise user-defined tolerable errors while reducing the data storage requirements by an order of ma
Effective start/end date9/1/18 β†’ 8/31/21


  • Department of Energy (DE-SC0019358//18SC503797)


Learning systems
Data storage equipment
Software engineering
Electric power utilization