SCR: Scalable Checkpoint/Restart for MPI


The SCR team conducts research on several fronts that are related to HPC application checkpointing. We broadly classify our research efforts into the following five categories.

Multilevel Checkpointing Research

Multilevel checkpointing greatly reduces checkpointing overhead and has the potential to bridge the gap for checkpoint/restart as we proceed towards exascale. The Scalable Checkpoint/Restart Library (SCR) is an example of a multilevel checkpointing system.

File System Research

Application checkpoints are typically stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like our Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to the compute nodes. However, most large scale systems do not provide file storage on compute nodes, preventing the use of SCR. It is essential to understand the role file systems play in efficient checkpointing.

Checkpoint Compression Research

Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for overloaded PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. Our compression research explores the idea of aggregating and compressing checkpoints based on data semantics in order to reduce contention for shared PFS resources and ultimately make checkpoint-restart affordable for HPC applications.

Asynchronous Checkpointing Research

Our asynchronous checkpointing system writes checkpoints through agents running on additional nodes that asynchronously transfer checkpoints from compute nodes to a parallel file system(PFS). The approach has two key advantages. It lowers application checkpoint overhead by overlapping computation and writing checkpoints to the PFS. Also, it reduces PFS load by using fewer concurrent writers and moderating the rate of PFS I/O operations.

I/O Scheduling Research

Scientific applications running on HPC systems can suffer from I/O bottlenecks during I/O intensive operations such as checkpointing. Moreover, as multiple jobs share common I/O resources, including the parallel file system, they often suffer from inter-application I/O interference. In this research, we explore these issues in present and future HPC storage systems using I/O scheduling techniques.