Achieving target MTTF by duplicating reliability-critical components in high performance computing systems

Nithin Nakka*, Alok Choudhary, Gary Grider, John Bent, James Nunez, Satsangat Khalsa

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant, through duplication, to achieve target MTTF at minimum expense. Not all components show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of components to be incrementally selected for protection to achieve a target MTTF. We propose a model for MTTF for tolerating failures in a specific component, system-wide, and order components according to the coverage provided. Systems grouped based on hardware configuration showed similar improvements in MTTF when different components in them were targeted for fault-tolerance.

Original languageEnglish (US)
Title of host publication2011 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW 2011
Pages1567-1576
Number of pages10
DOIs
StatePublished - Dec 20 2011
Event25th IEEE International Parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011 - Anchorage, AK, United States
Duration: May 16 2011May 20 2011

Publication series

NameIEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Other

Other25th IEEE International Parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011
Country/TerritoryUnited States
CityAnchorage, AK
Period5/16/115/20/11

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'Achieving target MTTF by duplicating reliability-critical components in high performance computing systems'. Together they form a unique fingerprint.

Cite this