Predicting node failure in high performance computing systems from failure and usage logs

Nithin Nakka*, Ankit Agrawal, Alok Nidhi Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Scopus citations

Abstract

In this paper, we apply data mining classification schemes to predict failures in a high performance computer system. Failure and Usage data logs collected on supercomputing clusters at Los Alamos National Laboratory (LANL) were used to extract instances of failure information. For each failure instance, past and future failure information is accumulated - time of usage, system idle time, time of unavailability, time since last failure, time to next failure. We performed two separate analyses, with and without classifying the failures based on their root cause. Based on this data, we applied some popular decision tree classifiers to predict if a failure would occur within 1 hour. Our experiments show that our prediction system predicts failures with a high-degree of precision up to 73% and recall of about 80%. We also observed that employing the usage data along with the failure data has improved the accuracy of prediction.

Original languageEnglish (US)
Title of host publication2011 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW 2011
Pages1557-1566
Number of pages10
DOIs
StatePublished - Dec 20 2011
Event25th IEEE International Parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011 - Anchorage, AK, United States
Duration: May 16 2011May 20 2011

Publication series

NameIEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Other

Other25th IEEE International Parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011
CountryUnited States
CityAnchorage, AK
Period5/16/115/20/11

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software
  • Theoretical Computer Science

Fingerprint Dive into the research topics of 'Predicting node failure in high performance computing systems from failure and usage logs'. Together they form a unique fingerprint.

Cite this