As global, broad-based climate change projections have become more useful, effectively managing the vast accompanying volumes of data represents a major challenge for the computational scientists who support the projections. In the area of understanding and predicting climate change and extreme weather events, advanced tools are required to securely store, manage, access, analyze, visualize, and process enormous and distributed data sets. This “big data” challenge is being met with the Earth System Grid Federation (ESGF), an international collaboration led by LLNL. Designed and maintained by dozens of American, European, Asian, and Australian research institutions, ESGF now powers most global climate change research, notably assessments by the International Panel on Climate Change.
ESGF combines grid-based computing with a distributed architecture, keeping participating members sovereign while simultaneously linking them together. To achieve that, ESGF developers created a unique system of nodes that requires very little explicit coordination while still providing a robust “data space” for storage and computation. The newest iteration of ESGF offers an immense, computerized climate database that standardizes and organizes observational and simulation data from 21 countries, allowing scientists to compare models against actual observations. A rich set of climate analysis tools is available to help manipulate the data.
ESGF allows teams to work in highly distributed research environments, using unique scientific instruments, exascale-class computers, and extreme amounts of data. Users can access ESGF data using Web browsers, scripts, and client applications. A key to ESGF’s success is its ability to effectively produce, validate, and analyze research results collaboratively, so that, for example, new results generated by one team member are immediately accessible to the rest of the team, who can annotate, comment on, and otherwise interact with those results.
The ESGF peer-to-peer architecture is based on a dynamic system of nodes—independently administered yet united by common protocols and interfaces—that interact on an equal basis and offer a broad range of user and data services, depending on how each is set up. Data are published, stored, and served from dozens of nodes around the globe, yet they are searchable and accessible as if they were stored in a single global archive. Metadata shared among projects help fully integrate the repository of data and components for usability and interoperability. ESGF also promotes standard conventions for data transformation, quality control, and data validation across processes and projects.
ESGF is designed to remain robust even as data volumes continue to grow exponentially. Currently, 25,000 users (researchers and nonresearchers) from 2,700 sites on six continents are sharing data through ESGF. More than 2 petabytes of data have been downloaded to the climate community through ESGF, making it one of the most complex, successful big data systems in existence.
For more information, contact Dean N. Williams.
Dealing with Data Overload in the Scientific Realm, Science & Technology Review, January 2013