Fault Tolerance and Resilience

Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. Resiliency depends heavily on the number of components in the system and the reliability of the individual components. Components may be reliable in consumer applications that contain only a handful of devices and yet have high aggregate failure rates in high-performance computing (HPC) systems, which might include millions of components. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. In particular, evidence points to a rise in silent errors—faults that are never detected or only detected long after they have generated erroneous results.

Most applications tolerate failures by periodically saving their state to reliable storage checkpoint files. Upon failure, an application can restart from a prior state by reading in a checkpoint. Unfortunately, at extreme scales, the time needed for traditional checkpointing and restarting will exceed the mean time to failure for a full system. Instead, researchers must find new and more comprehensive strategies for maintaining system reliability and performance. To address resiliency issues, LLNL computational scientists are developing better methods for detecting HPC faults and helping systems quickly recover from errors.

Of Interest

Finding and Fixing a Supercomputer’s Faults

Livermore experts have developed innovative methods to detect hardware faults in supercomputers and help applications recover from errors that do occur.

Kathryn Mohror loves HPC, faults and all.

Kathryn Mohror develops tools that give researchers the information they need to tune their programs and maximize results. After all, says Kathryn, “It’s all about getting the answers more quickly.”