On November 14, 2014, Secretary of Energy Ernest Moniz announced that a partnership involving IBM, NVIDIA, and Mellanox was chosen to design and develop systems for Lawrence Livermore and Oak Ridge (ORNL) national laboratories. The LLNL system, Sierra, will be the next advanced technology system sited at LLNL in the Advanced Simulation and Computing (ASC) Program’s system line that has included Blue Pacific, White, Purple, BlueGene/L, and Sequoia. As the next advanced technology system, Sierra will be expected to address the most demanding computing problems that the ASC Program and its stockpile stewardship mission face. To achieve this goal, the system must provide the largest capability available to ASC applications and incorporate novel technologies that foreshadow the future directions of the Department of Energy’s (DOE’s) large-scale systems on the path to exascale computing.
The partnership’s design for Sierra uses IBM Power architecture processors connected by NVLink to NVIDIA Volta graphics processing units (GPUs). NVLink is an interconnect bus that provides higher performance than the traditional Peripheral Component Interconnect Express for attaching hardware devices in a computer, allowing coherent direct access to GPU and memory. The machine will be connected with a Mellanox InfiniBand network using a fat-tree topology—a versatile network design that can be tailored to work efficiently with the bandwidth available. Sierra is expected to be at least seven times more powerful than LLNL’s current advanced technology system, Sequoia.
Sierra is part of the CORAL procurement, a first-of-its-kind collaboration between ORNL, Argonne, and LLNL that culminated in three pre-exascale high performance computing (HPC) systems to be delivered in the 2017 timeframe. CORAL was established by DOE to leverage supercomputing investments, to streamline procurement processes, and to reduce the costs to develop supercomputers.
“Our collaborative goal was to choose two systems that, as a set, offer the best overall value to DOE. We need diversity of technologies and vendors, as well as systems that will provide value to the DOE laboratories,” says Bronis de Supinski, chief technology officer for Livermore Computing (LLNL’s supercomputing center). “Diversity helps to offset risk and ensure that future systems will continue to meet our evolving needs.”
The Argonne and ORNL systems will help meet the future mission needs of the Advanced Scientific Computing Research program within the DOE’s Office of Science, while Sierra will serve the mission needs of the ASC Program within the National Nuclear Security Administration. The ORNL system, called Summit, will have the same architecture as Sierra, which demonstrates the synergies between the missions of the two parts of DOE.
Now that the contracts have been awarded to the IBM partnership, Nonrecurring Engineering (NRE) work to maximize the impact and utility of the resulting LLNL and ORNL systems has begun. NRE includes nonrecurring expenses paid to the vendors for design and engineering milestones specific to Sierra. This separate contract precedes the “build” contract to provide accelerated or modified development to enhance usability or effectiveness of the final system. The NRE contract provides significant benefit by creating a Center of Excellence (CoE) that will foster interaction between laboratory domain scientists and vendor experts as actual applications are ported and optimized for the new architecture. The NRE contract will also support exploration of motherboard design and cooling concepts; GPU reliability, file system performance, and open-source compiler infrastructure; and advanced systems diagnostics and scheduling along with advanced networking capabilities.
Several working groups that bring together the three laboratories and the IBM partnership have been formed to ensure the future Sierra and Summit systems meet DOE requirements. These working groups are now hubs of activity, addressing the programming environment, node design, and various other topics that will ensure the usability and performance of the final systems. The CoE functions effectively as one of the working groups. Working group discussions often turn into a co-design process to meet goals. Co-design draws on the combined expertise of vendor experts, including hardware architects and system software developers, and laboratory experts, such as domain scientists, computer scientists, and applied mathematicians—working together to make informed decisions about hardware and software components. Activities have also begun for the build contract, the first milestones of which are nearing completion.
A small, early-access system scheduled for delivery in 2016 will have an earlier generation of the IBM Power processor architecture, NVIDIA Pascal GPUs, and a version of NVLink. This early access system will support interactions on several critical topics, such as development of an effective compiler infrastructure. “It will be a complete precursor system, so we can explore the capabilities and begin to deploy some early software systems on the machine,” says de Supinski.