Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

Nithin Nakka*, Alok Choudhary

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model - driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.

Original languageEnglish (US)
Title of host publicationHigh Performance Computing Systems and Applications - 23rd International Symposium, HPCS 2009, Revised Selected Papers
Pages304-322
Number of pages19
DOIs
StatePublished - May 21 2010
Event23rd International Symposium on High Performance Computing Systems and Applications, HPCS 2009 - Kingston, ON, Canada
Duration: Jun 14 2009Jun 17 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5976 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other23rd International Symposium on High Performance Computing Systems and Applications, HPCS 2009
CountryCanada
CityKingston, ON
Period6/14/096/17/09

Keywords

  • Duplication
  • Fault-tolerance
  • Mean time to failure
  • Partial duplication

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems'. Together they form a unique fingerprint.

Cite this