Research Topics in Scientific Data Mining

The Sapphire project is developing scalable algorithms for the interactive exploration of large, complex, multi-dimensional scientific data.  We are applying and extending ideas from data mining, image and video processing, statistics, and pattern recognition in order to improve the way in which scientists extract useful information from data. Our work is done in the context of data analysis problems which arise in data from observations, simulations, and experiments. To address the challenges that arise when data analysis techniques are applied to massive and complex data sets, we are focusing on the following research areas:
  • Image processing techniques for denoising, object identification, and feature extraction
  • Dimension reduction techniques to handle multi-dimensional data
  • Scalable algorithms for classification and clustering
  • Parallel implementations for interactive exploration of data
  • Applied statistics to ensure that the conclusions drawn from the data are statistically sound
We focus on research in algorithms, incorporation of this research into software, and the application of the software to real-world problems at LLNL and elsewhere. The needs of these applications, in turn, drives our research. Sapphire research has resulted in the following patents:
  • Chandrika Kamath, Erick Cantu-Paz, "Parallel Object-Oriented Data Mining System," U.S. Patent 6,675,164 B2, January 6, 2004.
  • Chandrika Kamath, Erick Cantu-Paz, David Littau, "Using Histograms to Introduce Randomization in the Generation of Ensembles of Decision Trees," U.S. Patent 6,859,804 B2, February 22, 2005.
  • Chandrika Kamath, Chuck H. Baldwin, Imola K. Fodor, Nu A. Tang, "Parallel Object-Oriented, Denoising System Using Wavelet Multiresolution Analysis," U.S. Patent 6,879,729 B2, April 12, 2005.
  • Chandrika Kamath and Erick Cantu-Paz, "Creating Ensembles of Decision Trees through Sampling," U.S. Patent No. 6,938,049 B2, August 30, 2005.
  • Chandrika Kamath and Erick Cantu-Paz, "Parallel object-oriented decision tree system," U.S. Patent No. 7,007,035 B2, February 28, 2006.
  • Erick Cantu-Paz and Chandrika Kamath, "Creating ensembles of oblique decision trees with evolutionary algorithms and sampling,"  U.S. Patent No. 7,062,504 B2, June 13, 2006.
R&D 100 Award  The Sapphire team (Erick Cantu-Paz, Samson Cheung, Abel Gezahegne, Cyrus Harrison, Chandrika Kamath, and Nu Ai Tang) received the 2006 R&D 100 award for their work on the Sapphire scientific data mining software. [S&TR article]

More details on our research are available in our publications. The following pages summarize the key aspects of our work.

Our work in applications and algorithms is summarized in the following:

The Sapphire team would like to acknowledge the following funding sources (in alphabetical order), who made this research possible: the Advanced Simulation and Computing (ASC) Program through DOE, Department of Homeland Security, the Laboratory-Directed Research and Development (LDRD) Program at LLNL, and the DOE Office of Science SciDAC Program.

For more technical information, contact: -- Chandrika Kamath, (925) 423-3768
UCRL-WEB-214348      These pages were last modified on March 4, 2009        LLNL Disclaimer