Machine learning and data analytics enhance the scientific process
big data diagram

Survey

Please consider helping us improve by answering our 3 question survey.

A Converging Path for Simulation, Machine Learning, and Big Data

Monday, April 18, 2016

The marriage of experimental science with simulating and modeling natural phenomena using high-performance computing (HPC) has been a fruitful one. Experimentation provides raw data for simulating a natural process, and supports or disproves a working hypothesis. Simulation helps scientists know how well or poorly they understand what they are studying by comparing the simulation against observations of the real thing.

The feedback between HPC-based simulation and experimentation moves science forward faster than experimentation alone, eliminating unlikely hypotheses and pointing in more rewarding directions. Livermore researchers are now exploring a new twist in simulation-based discovery. Two currents in computation, machine learning and data analytics, popularly called “big data,” are poised to transform this way of doing science.

Since its beginning, Livermore has been a leader in applying HPC to its national security missions, focusing on nuclear stockpile stewardship as well as on answering basic scientific questions about how the universe works. However, data sets researchers work with have become so large that they are looking to the convergence of simulation with machine learning (computer algorithms that can learn and make predictions from data) and data analytics (the search for patterns and correlations in data sets that are extremely large, and sometimes highly unstructured) to usher in the next advance in how they make scientific discoveries.

Frederick Streitz, chief computational scientist and director of Livermore’s High Performance Computing Innovation Center, says, “The magic [of this convergence] is the two sides working together within one architecture to solve big scientific problems. The space is uncharted.”

A familiar situation in HPC-based simulation is a code crash, which occurs when the math becomes messy. In a hydrodynamic simulation, turbulence and other physical phenomena cause the simulation to halt. Jim Brase, Computation’s deputy associate director for data science, explains that HPC simulations often use a mesh mapped onto the object or phenomenon under study. “The mesh moves with the object, for example, by rotating, and it can actually become entangled—mesh lines can cross and cause the simulation to crash.”

Typically, the research team will adjust the simulation mesh at the point where the crash took place. But valuable human development and computer processing time is wasted, so scientists would like to know in advance where a crash might happen. A Laboratory Directed Research and Development project seeks to remedy this issue. The team, led by Livermore researcher Ming Jiang, is using machine learning algorithms to predict mesh entanglement, and their approach is showing success.

“If [machine learning] does nothing more, that’s already a win. But we can do better,” says Streitz. He speculates that it should be possible to develop an algorithm that predicts where the crash will take place and then apply a fix while the simulation is in progress.

Another project is under way to improve cybersecurity. According to Brase, Livermore researchers are creating maps of structural and functional elements of computer networks, inserting those maps into simulations, and asking machine learning algorithms to answer “what-if” questions about vulnerabilities in the network.

Machine learning-based analytics of very large data sets could uncover correlations that are too complex for humans to pick out. From physics to molecular biology, difficulties in analyzing very large data sets, for example genes and other large proteins, have stymied progress. “We know how to model proteins,” says Streitz. “Suppose we know a protein is involved in a signaling pathway. We know its end state, but not the entire pathway. We can propose a pathway and do experiments to figure out which one is correct. However, these are time-intensive and expensive.”

Streitz suggests another approach: Apply a machine learning algorithm to the data set and find correlations between variables. However, correlation does not imply causation. Therefore, posit that a correlation is the cause of the observation—in this case, that a particular gene is responsible for the signaling pathway. Then let the machine learning code set up and run a simulation that tests this hypothesis.

A project just getting under way will use this approach to examine how the Ras protein family causes various cancers. The work is a collaboration of Livermore, the National Cancer Institute’s Frederick National Laboratory, as well as Los Alamos, Argonne, and Oak Ridge national labs.

Says Streitz, “This approach (merging machine learning, data analytics, and simulation) could enable us to do automated hypothesis generation on a grand scale. It could potentially change how we do scientific research.”

Almost any field could benefit from adding machine learning to simulation. Applied to measurements within an industrial production facility, for example, it could optimize and regulate production dynamically in real time.

The machine learning–simulation convergence demonstrates that Livermore’s national security and scientific discovery missions are locked in a positive feedback loop—scientific methods devised to carry out the national security missions contribute to answering basic science questions, which, in turn, drive development of new science and technology for the missions.