Listed here are some of the many of the research projects and topics underway in Computation. Projects vary in size, scope, and duration, but what they share is a focus on developing tools and methods that help LLNL deliver on its missions to the nation and, more broadly, advance the state of the art in high-performance scientific computing.
Application-level resilience is emerging as an alternative to traditional fault-tolerance approaches because it provides fault tolerance at a lower cost than traditional approaches. LLNL researchers are implementing application-level resilience in ddcMD, which now has the ability to incorporate lost data again in its workload and continue its execution in the presence of most errors without needing to restart the entire application.
AutomaDeD: Diagnosing Performance and Correctness Faults
AutomaDeD is a tool that automatically diagnoses performance and correctness faults in MPI applications. It has two major functionalities: identifying abnormal MPI tasks and code regions and finding the least-progressed task. The tool produces a ranking of MPI processes by their abnormality degree and specifies the regions of code where faults are first manifested.
BLAST: High-Order Finite Element Hydrodynamics
Through research funded at LLNL, scientists have developed BLAST, a high-order finite element hydrodynamics research code that improves the accuracy of simulations, provides a path to extreme parallel computing and exascale architectures, and gives a high performance computing advantage since its greater FLOP/byte ratios result in more time spent on floating point operations relative to memory transfer.
Caliper: Application Introspection System
A comprehensive understanding of the performance behavior of large-scale simulations requires the ability to compile, analyze, and compare measurements and contexts from many independent sources. Caliper, a general-purpose application introspection system, makes that task easier by connecting various independent context annotations, measurement services, and data processing services.
The Department of Energy (DOE) has a long history of deploying leading-edge computing capability for science and national security.
Cram: Running Millions of Concurrent MPI Jobs
Cram lets you easily run many small MPI jobs within a single, large MPI job by splitting MPI_COMM_WORLD up into many small communicators to run each job in the cram file independently. A job comprises the pieces needed to run a parallel MPI program. Cram was created to allow automated test suites to pack more jobs into a BG/Q partition, and to run large ensembles on systems where the scheduler will not scale.
Data-Intensive Computing Solutions
New platforms are improving big data computing on Livermore’s high performance computers.
Derived Field Generation Execution Strategies
Livermore computer scientists have helped create a flexible framework that aids programmers in creating source code that can be used effectively on multiple hardware architectures.
Enhancing Image Processing Methods
Researchers are developing enhanced computed tomography image processing methods for explosives identification and other national security applications.
ESGF: Supporting Climate Research Collaboration
The Earth System Grid Federation is a web-based tool set that powers most global climate change research.
ETHOS: Enabling Technologies for High-Order Simulations
The Enabling Technologies for High-Order Simulations (ETHOS) project performs research of fundamental mathematical technologies for next-generation high-order simulations
ExReDi: Extreme Resilient Discretization
Because of the end of Dennard scaling, computing capability is increasing through more processing units, not faster clock
FGFS: Fast Global File Status
Fast Global File Status (FGFS) is an open-source package that provides scalable mechanisms and programming interfaces to retrieve global information of a file, including its degree of distribution or replication and consistency. It turns expensive, non-scalable file system calls into simple string comparison operations. Most FGFS file status queries complete in 272 milliseconds or faster at 32,768 MPI processes, with the most expensive operation clocking in at less than 7 seconds.
Flux: Building a Framework for Resource Management
Livermore researchers are developing a toolset for solving data center bottlenecks.
GLVis: Finite Element Visualization
GLVis is a lightweight OpenGL-based tool for accurate and flexible finite element visualization. It is based on MFEM, a finite element library developed at LLNL. GLVis provides interactive visualizations of general finite element meshes and solutions, both in serial and in parallel. It encodes a large amount of parallel finite element domain-specific knowledge; e.g., it allows the user to view parallel meshes as one piece, but it also gives them the ability to isolate each component and observe it individually. It provides support for arbitrary high-order and NURBS meshes (NURBS allow more accurate geometric representation) and accepts multiple socket connections so that the user may have multiple fully-functional visualizations open at one time. GLVis can also run a batch sequence, or a series of commands, which gives the user precise control over visualizations and enables them to easily generate animations.
GREMLINs: Emulating Exascale Conditions on Today's Platforms
To overcome the shortcomings of the analytical and architectural approaches to performance modeling and evaluation, we are developing techniques that emulate the behavior of anticipated future architectures on current machines. We are implementing our emulation approaches in what we call the GREMLIN framework. Using GREMLIN, we can emulate a combined effect of power limitations and reduced memory bandwidth and then measure the impact of the GREMLIN modifications.
High-order Finite Volume Methods
High-resolution finite volume methods are being developed for solving problems in complex phase space geometries, motivated by kinetic models of fusion plasmas. Techniques being investigated include conservative, high-order methods based on the method-of-lines for hyperbolic problems, as well as coupling to implicit solvers for fields equations. Mapped multiblock grids enable alignement of the grid coordinate directions to accomodate strong anisotrropy. The algorithms developed will be broadly applicable to systems of equations with conservative formulations in mapped geometries.
HPC Code Performance: Challenges and Solutions
LLNL researchers are finding some factors are more important in determining HPC application performance than traditionally thought.
HYPRE: Scalable Linear Solvers and Multigrid Methods
Livermore’s hypre library of solvers makes larger, more detailed simulations possible by solving problems faster than ever before. It offers one of the most comprehensive suites of scalable parallel linear solvers available for large-scale scientific simulation.
InfiniBand: Improving Communications for Large-scale Computing
Livermore Computing staff is enhancing the high-speed InfiniBand data network used in many of its high-performance computing and file systems.
Message passing can reduce throughput for massively parallel science simulation codes by 30% or more due to contention with other jobs for the network links. We investigated potential causes of performance variability. Reducing this variability could improve overall throughput at a computer center and save energy costs.
LibRom: POD-based Reduced Order Modeling
LibRom is a library designed to facilitate Proper Orthogonal Decomposition (POD) based Reduced Order Modeling (ROM). In POD
LMAT: Livermore Metagenomics Analysis Toolkit
The Livermore Metagenomic Analysis Toolkit (LMAT) is a genome sequencing technology that helps accelerate the comparison of genetic fragments with reference genomes and improve the accuracy of the results as compared to previous technologies. It tracks approximately 25 billion short sequences and is currently being evaluated for potential operational use in global biosurveillance and microbial forensics by various federal agencies.
Machine Learning: Strengthening Performance Predictions
LLNL computer scientists are using machine learning to model and characterize the performance and ultimately accelerate the development of adaptive applications.
Master Block List: Protecting Against Cyber Threats
Master Block List is a service and data aggregation tool that aids Department of Energy facilities in creating filters and blocks to prevent cyber attacks.
Mathematical Techniques for Data Mining Analysis
Newly developed mathematical techniques reveal important tools for data mining analysis.
The advent of many-core processors with a greatly reduced amount of per-core memory has shifted the bottleneck in computing from FLOPs to memory. A new, complex memory/storage hierarchy is emerging, with persistent memories offering greatly expanded capacity, and augmented by DRAM/SRAM cache and scratchpads to mitigate latency. As shown above, non-volatile random access memory (NVRAM), Resistive RAM (RRAM), or Phase Change Memory (PCM) may be memory or I/O bus attached, and may utilize DRAM buffers to improve latency and reduce wear.
Our research program focuses on transforming the memory-storage interface with three complementary approaches:
*Active memory and storage in which processing is shared between CPU and in-memory/storage controllers,
*Efficient software cache and scratchpad management, enabling memory-mapped access to large, local persistent stores,
*Algorithms and applications that provide a latency-tolerant, throughput-driven, massively concurrent computation model.
MFEM: Scalable Finite Element Discretization Library
Livermore’s open-source MFEM library enables application scientists to quickly prototype parallel physics application codes based on partial differential equations (PDEs) discretized with high-order finite elements. The MFEM library is designed to be lightweight, general and highly scalable, and conceptually can be viewed as a finite element toolkit that provides the building blocks for developing finite element algorithms in a manner similar to that of MATLAB for linear algebra methods. It has a number of unique features, including: support for arbitrary order finite element meshes and spaces with both conforming and nonconforming adaptive mesh refinement; advanced finite element spaces and discretizations, such as mixed methods, DG (discontinuous Galerkin), DPG (discontinuous Petrov-Galerkin) and Isogeometric Analysis (IGA) on NURBS (Non-Uniform Rational B-Splines) meshes; and native support for the high-performance Algebraic Multigrid (AMG) preconditioners from the HYPRE library.
MPI_T: Tools for MPI 3.0
MPI_T is an interface for tools introduced in the 3.0 version of MPI. The interface provides mechanisms for tools to access and set performance and control variables that are exposed by an MPI implementation. The latest versions of major MPI implementations are already providing MPI_T functionality, making it widely accessible to users. We have developed a set of MPI_T tools, Gyan and VarList, to help tool writers with the new interface.
Network Modeling and Simulation
To ensure that the supercomputing power at our disposal is not wasted, we must ascertain that our applications can run at their peak performance; the amount of communication in an application will be the primary determinant of performance at those scales. Fast, scalable, and accurate modeling/simulation of an application’s communication is required to prepare parallel applications for exascale.
O(N) First Principles Molecular Dynamics
LLNL researchers are developing a truly scalable first-principles molecular dynamics algorithm with O(N) complexity and controllable accuracy, capable of simulating systems of sizes that were previously impossible with this degree of accuracy. By avoiding global communications, a practical computational scheme capable of extreme scalability has been implemented.
PAVE: Performance Analysis and Visualization at Exascale
Performance analysis of parallel scientific codes is becoming increasingly difficult, and existing tools fall short in revealing the root causes of performance problems. We have developed the HAC model, which allows us to directly compare the data across domains and use data visualization and analysis tools available in other domains.
PDES: Modeling Complex, Asynchronous Systems
PDES focuses on models that can accurately and effectively simulate California’s large-scale electric grid.
Phase Field Modeling
Livermore researchers have developed an algorithm for the numerical solution of a phase-field model of microstructure evolution in polycrystalline materials. The system of equations includes a local order parameter, a quaternion representation of local orientation and species composition. The approach is based on a Finite Volume discretization and an implicit time-stepping algorithm. Recent developments have been focused on modeling solidification in binary alloys, coupled with CALPHAD methodology.
Modern processors offer a wide range of control and measurement features that are traditionally accessed through libraries like PAPI. However, some newer features no longer follow the traditional model of counters, and all of these features are controlled through Model Specific Registers (MSRs). libMSR provides a convenient interface to access MSRs and to allow tools to utilize their full functionality.
Predictive Vascular Modeling
Livermore researchers are enhancing HARVEY, an open-source parallel fluid dynamics application designed to model blood flow in patient-specific geometries. Researchers will use HARVEY to achieve a better understanding of vascular diseases as well as cancer cell movement through the bloodstream. Establishment of a robust research platform could have direct impact on patient care. HARVEY is also an enabling capability for the BAASiC initiative.
PSUADE: Uncertainty Quantification
The growth of high-performance supercomputing technology and advances in numerical techniques have resulted in the emergence of the uncertainty quantification (UQ) discipline, whose goal is to enable scientists to make precise statements about the degree of confidence they have in their simulation-based predictions. Uncertainty quantification is defined as the identification, characterization, propagation, analysis, and reduction of all uncertainties in simulation models.
P^nMPI: Low-overhead Wrapper Library
PMPI is a success story for HPC Tools, but it has a number of shortcomings. LLNL researchers aimed to virtualize the PMPI interface, enable dynamic linking of multiple PMPI tools, create extensions for modularity, reuse existing binary PMPI tools, and allow dynamic tool chain selection. The result is PnMPI, a thin, low-overhead wrapper library that is automatically generated from mpi.h file and that can be linked by default.
Qbox: Computing Electronic Structures at the Quantum Level
LLNL’s version of Qbox, a first-principles molecular dynamics code, will let researchers accurately calculate bigger systems on supercomputers.
RAJA: Managing Application Portability for Next-Generation Platforms
A Livermore-developed programming approach helps software to run on different platforms without major disruption to the source code.
ROSE, an open-source project maintained by Livermore researchers, provides easy access to complex, automated compiler technology and assistance.
SAMRAI: Structured Adaptive Mesh Refinement Application Infrastructure
The Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory is developing algorithms and software technology to enable the application of structured adaptive mesh refinement (SAMR) to large-scale multi-physics problems relevant to U.S. Department of Energy programs. The SAMRAI (Structured Adaptive Mesh Refinement Application Infrastructure) library is the code base in CASC for exploring application, numerical, parallel computing, and software issues associated with SAMR.
Scalable Quantum Molecular Dynamics Simulations
LLNL researchers are developing a new algorithm for use with first-principles molecular dynamics (FPMD) codes that will enable the number of atoms simulated to be proportional to the number of processors available; with traditional algorithms, the size of simulations is much too small to model complex systems or realistic materials. The researchers have achieved excellent scaling on 100,000 cores of Vulcan with 100,000 atoms at a rate of about 4 minutes per time step.
Scaling Up Transport Sweep Algorithms
LLNL researchers are testing and enhancing a neutral particle transport code and the algorithm on which the code relies to ensure that they successfully scale to larger and more complex computing systems.
SCR: Scalable Checkpoint/Restart for MPI
To evaluate the multilevel checkpoint approach in a large-scale, production system context, LLNL researchers developed the Scalable Checkpoint/Restart (SCR) library. With SCR, we have found that jobs run more efficiently, recover more work upon failure, and reduce load on critical shared resources. Research efforts now focus on reducing the overhead of writing checkpoints even further.
Serpentine Wave Propagation
The Serpentine project develops advanced finite difference methods for solving hyperbolic wave propagation problems. Our approach is based on solving the governing equations in second order differential formulation using difference operators that satisfy the summation by parts (SBP) principle. The SBP property of our finite difference operators guarantees stability of the scheme in an energy norm.
Spack: A Flexible Package Manager for HPC Software
High-performance computing (HPC) software is becoming increasingly complex, quickly outpacing the capabilities of existing software management tools.
Spindle: Scalable Shared Library Loading
Spindle is a tool for improving the library-loading performance of dynamically-linked HPC applications. It plugs into the system’s dynamic linker and intercepts its file operations so that only one process (or other small amount) will perform the file operations necessary and share the results with other processes in the job.
StarSapphire: Data-driven Modeling and Analysis
StarSapphire is a collection of projects in the area of scientific data mining focusing on the analysis of data from scientific simulations, observations, and experiments.
STAT: Discovering Supercomputers' Code Errors
LLNL’s Stack Trace Analysis Tool helps users quickly identify errors in code running on today’s largest machines.
SUNDIALS: SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers
SUNDIALS is a SUite of Nonlinear and DIfferential/ALgebraic equation Solvers. It consists of the following 6 solvers: consists of the following six solvers: CVODE, solves initial value problems for ordinary differential equation (ODE) systems; CVODES, solves ODE systems and includes sensitivity analysis capabilities (forward and adjoint); ARKODE, solves initial value ODE problems with additive Runge-Kutta methods, include support for IMEX methods; IDA, solves initial value problems for differential-algebraic equation (DAE) systems; IDAS, solves DAE systems and includes sensitivity analysis capabilities (forward and adjoint); KINSOL, solves nonlinear algebraic systems.
As processors have become faster over the years, the cost of communicating data has grown higher. It is imperative to maximize data locality and minimize data movement on-node and off-node. Using profiling tools, we can characterize different classes of applications and use specialized profilers to measure specific phases of an application in detail. We can also predict the performance benefits of intelligently mapping applications by combining a variety of network and system measurements.
TESSA: Tracking Space Debris
Testbed Environment for Space Situational Awareness software helps to track satellites and space debris and prevent collisions.
Topological Analysis: Charting Data’s Peaks and Valleys
LLNL and University of Utah researchers have developed an advanced, intuitive method for analyzing and visualizing complex data sets.
TOSS: Speeding Up Commodity Cluster Computing
Researchers have been developing a standardized and optimized operating system and software for deployment across a series of Linux clusters to enable high-performance computing at a reduced cost.
Veritas: Validating Proxy Apps
Veritas provides a method for validating proxy applications to ensure that they capture the intended characteristics of their parents. Previously, the validation process has been done mostly by manually matching algorithmic steps in proxy applications to the parent or by relying on the experience of the code developer. Veritas can identify and compare performance sinks in areas such as memory, cache utilization, and network utilization.
XBraid: Parallel Time Integration with Multigrid
The scalable multigrid reduction in time (MGRIT) approach was developed by LLNL researchers in response to a bottleneck of traditional sequential time-marching algorithms caused by stagnant clock speeds. It constructs coarse time grids and uses each coarse time scale solution to improve the next finer-scale solution, ultimately yielding an iterative scheme that simultaneously updates in parallel a solution guess over the entire space-time domain.
zfp & fpzip: Floating Point Compression
zfp is an open source C++ library for compressed floating-point arrays that support very high throughput read and write random access. It was designed to achieve high compression ratios and therefore uses lossy but optionally error-bounded compression. fpzip is a library for lossless or lossy compression of 2D or 3D floating-point scalar fields. It was primarily designed for lossless compression.
ZFS: Improving Lustre Efficiency
Livermore computer scientists are incorporating the Zettabyte File System into their high-performance parallel file systems for better performance and scalability.