StarSapphire: Data-driven Modeling and Analysis

StarSapphire is a collection of projects in the area of scientific data mining focusing on the analysis of data from scientific simulations, observations, and experiments. StarSapphire is a follow-on to the Sapphire scientific data mining project, where we conducted research in algorithms, incorporated this research into software, and applied the software to real-world problems, which, in turn, motivated our research. Our experiences showed that we could use techniques from data mining, machine learning, image and video processing, statistics, and pattern recognition to improve the way in which scientists extract useful information from data.

In the StarSapphire projects, we are leveraging these earlier experiences to address the recent challenges in data-driven modeling and analysis. These challenges are the result of newer types of data, such as data streams, larger volumes of data, such as those resulting from three-dimensional simulations of complex phenomena, and new constraints on the analysis, such as the need for in-situ analysis in exascale systems or real-time analysis for anomaly detection.

Despite these new challenges, our approach to analysis remains the same as the one we developed and used in Sapphire (shown in Figure 1). This approach worked very well in the analysis of data from a variety of problems in many different domains. We found that it was important to consider scientific data mining as an iterative and interactive process, involving data pre-processing, search for patterns, knowledge evaluation, and possible refinement of the process based on input from domain experts or feedback from one of the steps. As the pre-processing of the data is a time-consuming, but critical, first step in data mining, we include it as an integral part of the process. The pre-processing is often domain and problem dependent; however, several techniques developed in the context of one problem or domain can be applied to other problems and domains as well. The pattern recognition step is usually independent of the domain or problem.

Figure 1. Our end-to-end approach to data analysis.


As part of StarSapphire, we are involved in the following projects:

  • Poincaré plots: Classification and characterization of orbits
  • Blob tracking: Analysis of coherent structures in NSTX images
  • GSEP Analysis: Analysis of fluid and particle data from GSEP simulations
  • WindSENSE: Managing the integration of wind energy on the power grid
  • SensorStreams: Real-time analysis of streaming data from sensors
  • MINDES: Data mining for inverse design
  • Exa-DM: Enabling scientific discovery in exascale simulations



We gratefully acknowledge the following funding sources (in alphabetical order) for supporting our work:

Why StarSapphire?

A star sapphire is a type of sapphire that exhibits a star-like phenomena called asterism due to the presence of titanium dioxide impurities. The star effect results when light reflects from the needle-like inclusions of the impurities aligned perpendicular to the rays of the star. There were several options to name the follow-on project to Sapphire. A programming viewpoint might have called it “Sapphire++”, a statistical viewpoint would have resulted in “Sapphire-PLUS”, and an image understanding approach would have led to “Sapphire Junior”. However, given the data mining focus, it seemed appropriate to name the collection of projects after another gemstone - hence, the choice of StarSapphire.