Bringing the next advanced technology system, Sierra, to Livermore
On November 14, 2014, Secretary of Energy Ernest Moniz announced that a partnership involving IBM, NVIDIA, and Mellanox was chosen to design and develop systems for Lawrence Livermore and Oak Ridge (ORNL) national laboratories. The LLNL system, Sierra, is the next advanced technology system sited at LLNL in the Advanced Simulation and Computing (ASC) Program’s system line that has included Blue Pacific, White, Purple, BlueGene/L, and Sequoia. As the latest advanced technology system, Sierra is addressing the most demanding computing problems that the ASC Program and its stockpile stewardship mission face. To achieve this goal, the system must provide the largest capability available to ASC applications and incorporate novel technologies that foreshadow the future directions of the Department of Energy’s (DOE’s) large-scale systems on the path to exascale computing.
The partnership’s design for Sierra required IBM Power architecture processors connected by NVLink to NVIDIA Volta graphics processing units (GPUs). NVLink is an interconnect bus that provides higher performance than the traditional Peripheral Component Interconnect Express for attaching hardware devices in a computer, allowing coherent direct access to GPU and memory. The machine is connected with a Mellanox InfiniBand network using a fat-tree topology—a versatile network design that can be tailored to work efficiently with the bandwidth available. Sierra is expected to be at least seven times more powerful than LLNL’s current advanced technology system, Sequoia.
Sierra is part of the CORAL procurement, a first-of-its-kind collaboration between ORNL, Argonne, and LLNL that culminated in three pre-exascale high performance computing (HPC) systems delivered in 2017. CORAL was established by DOE to leverage supercomputing investments, to streamline procurement processes, and to reduce the costs to develop supercomputers.
“Our collaborative goal was to choose two systems that, as a set, offer the best overall value to DOE. We need diversity of technologies and vendors, as well as systems that provide value to the DOE laboratories,” says Bronis de Supinski, chief technology officer for Livermore Computing (LLNL’s supercomputing center). “Diversity helps to offset risk and ensure that future systems will continue to meet our evolving needs.”
The Argonne and ORNL systems help meet the future mission needs of the Advanced Scientific Computing Research program within the DOE’s Office of Science, while Sierra serves the mission needs of the ASC Program within the National Nuclear Security Administration. The ORNL system, called Summit, has the same architecture as Sierra, which demonstrates the synergies between the missions of the two parts of DOE.
Once contracts were awarded, Nonrecurring Engineering (NRE) work maximized the impact and utility of the resulting LLNL and ORNL systems. NRE includes nonrecurring expenses paid to the vendors for design and engineering milestones specific to Sierra. This separate contract preceded the “build” contract to provide accelerated or modified development to enhance usability or effectiveness of the final system. The NRE contract provided significant benefits by creating a Center of Excellence (CoE) that fosters interaction between laboratory domain scientists and vendor experts as actual applications are ported and optimized for the new architecture. The NRE contract also supported exploration of motherboard design and cooling concepts; GPU reliability, file system performance, and open-source compiler infrastructure; and advanced systems diagnostics and scheduling along with advanced networking capabilities.
Several working groups that brought together the three laboratories and the IBM partnership were formed to ensure the Sierra and Summit systems meet DOE requirements. These working groups addressed the programming environment, node design, and various other topics that ensured the usability and performance of the final systems. The CoE functioned effectively as one of the working groups. Working group discussions often turned into a co-design process to meet goals. Co-design draws on the combined expertise of vendor experts, including hardware architects and system software developers, and laboratory experts, such as domain scientists, computer scientists, and applied mathematicians—working together to make informed decisions about hardware and software components.
A small, early-access system delivered in 2016 had an earlier generation of the IBM Power processor architecture, NVIDIA Pascal GPUs, and a version of NVLink. This early access system supports interactions on several critical topics, such as development of an effective compiler infrastructure. “It is a complete precursor system, so we can explore the capabilities and deploy some early software systems on the machine,” says de Supinski.