RAJA: Managing Application Portability for Next-Generation Platforms
Advanced technology (AT) system node architectures are becoming more complex and diverse as hardware vendors strive to deliver performance gains within constraints such as power usage. This makes transforming codes so they can run efficiently on multiple platforms increasingly time consuming and difficult. The challenges are particularly acute for Advanced Simulation and Computing (ASC) Program multiphysics codes, which are essential tools for Livermore’s nuclear stockpile stewardship mission. A typical large integrated physics code contains millions of lines of source code and tens of thousands of loops, in which a wide range of complex numerical operations are performed. Variations in hardware and parallel programming models make it increasingly difficult to achieve high performance without disruptive platform-specific changes to application software.
To address this challenge, computer scientists Richard Hornung and Jeff Keasler are developing RAJA, a software abstraction that systematically encapsulates platform-specific code to enable applications to be portable across diverse hardware architectures without major source code disruption. RAJA is designed to integrate with existing codes and provide a development model for new codes to be portable from inception. Basic insertion of RAJA enables a code to run on different platforms. Then, architecture-specific tunings can be pursued within the RAJA layer without substantial application code disruption.
The fundamental conceptual abstraction in RAJA is an inner loop, where the overwhelming majority of computational work in most physics codes occurs. The main idea promoted by RAJA is a separation of loop bodies from their iteration patterns, to encapsulate various execution issues.
Hydrodynamics packages in ARES and KULL were used to evaluate basic RAJA usage. LULESH, a proxy for the Lagrange hydrodynamics algorithm in ALE3D, was used to demonstrate RAJA flexibility and more advanced concepts. Hornung explains, “These codes use very different software constructs, mesh types, and methodologies for looping over mesh-based data. Parts of our evaluation used the RAJA reference implementation while others required specialization of RAJA concepts for specific code situations. Such customization is a fundamental design goal of RAJA.”
Nearly all loops in the ARES Lagrangian hydro package were converted to RAJA. Nominal runs on the Blue Gene/Q platform (Livermore’s current Advanced Technology system, Sequoia) yielded a 50% speedup by introducing four-way OpenMP inner loop threading. KULL performance was studied on a loop-by-loop basis; some loops saw no benefit while others saw close to perfect four-fold speedup. Such speedups are due to hardware threading, which is not typically used in ASC’s message-passing interface–usually, only codes are used due to memory limitations. Thus, these gains are a pure performance win over the status quo.
This work proved key aspects of RAJA. It is sufficiently flexible to support Livermore production codes; it can enhance readability and maintainability as opposed to other programming models that can “clutter” code, such as directive-based models; and it can enable easy exploration of different programming models and data layouts (for example, switching between OpenMP and CUDA and reordering loops and data is simplified.)
Based on these investigations, several avenues of future work will be pursued, including continued RAJA development and exploration of portability issues related to future AT platforms, such as Trinity (Intel many-integrated core) and Sierra (GPU). Compiler features and optimization issues required for RAJA are being addressed with vendors through Trinity and Sierra “centers of excellence” and other co-design activities. Several Livermore production codes are also actively moving toward adopting RAJA.