CCSM DATA MANAGEMENT PLAN (Aug 2005)
In Revision: November 2010
The Community Climate System Model (CCSM) Data Management Plan documents the procedures for the management of data associated with the CCSM project.These data policies and plans are the agreed-upon approaches, standards, and conventions that coordinate the production and post-processing of CCSM data across the CCSM working groups.The overall goal of CCSM data management is to provide the best possible access and ease-of-use of high-quality CCSM data to diverse users within the constraints of available resources.
Some of the key elements of a CCSM Data Management Plan include:
1. Definition of the Categories of CCSM Data,
2. Ownership Rights and Responsibilities for CCSM Data,
3. The CCSM Data Release Timeline,
4. Stewardship of CCSM Data,
5. CCSM Data Format Standards,
6. CCSM Metadata Requirements,
7. CCSM Case and File Naming Conventions,
8. CCSM Data Quality Assurance Procedures,
9. Access to CCSM data,
10. Registering and Auditing the Characteristics of CCSM Data Users,
11. Distributed CCSM Data Repositories,
12. Distribution of CCSM Data Products to Non-NSF Entities, and
13. Changes to CCSM Data Plan.
This plan supersedes any previous CCSM Data Management Plan and policies.These policies are not intended to be retroactive.The CCSM Data Management Plan is a living document that will evolve as the scope of the project and its customer base changes over time.
The CCSM Data Management Policies and Plans
In accordance with the National Science Foundation (NSF) data policy, the CCSM project is committed to the timely submission for publication of results from the CCSM model runs and sharing of the scientific data generated in CCSM research activities. Open access to the CCSM data products is essential for validating model results and promoting scientific discovery in the climate research field, but the principal investigators (PIs) who have designed and performed an Experiment Simulation shall have the right to the first benefits to be derived from that run.These policies apply to all CCSM data created under the auspices of a CCSM Working Group.
1. Definition of Categories of CCSM Data
CCSM Control Runs are very long integrations (typically hundreds to thousands of model years), using a code base that corresponds to a control version of the CCSM code.Control Runs define the basic long-term climate of the CCSM.A control run needs to run long enough for slow adjustment processes, such as subsurface water in the land model, to come into balance. Some of these processes, such as the deep ocean heat and salinity, may take thousands of years to reach balance. The standard data output corresponds to monthly averages, with daily averages for a few select variables.Higher frequency data output for special analyses or to drive regional scale models may also be provided upon specific request, but are usually created for limited periods of the integration.
CCSM Experiment Simulations are long integrations (typically tens to hundreds of model years) usually made with modifications introduced into the control version of the CCSM to conduct a scientific experiment or policy scenario.A CCSM Experiment Simulation may be a single run or a series of runs. The modifications may be either in the representations of the processes in the model, the forcing data, or both. Experiment Simulations are usually a few model centuries but may extend to millennia. Usually, a given experiment design will include the production of an ensemble of runs to provide an estimate of the range and significance of the model response to the changes. The ensemble of runs constitutes the data for the particular Experiment Simulations.
CCSM Evaluation Runs are short runs (typically model days to years) made to examine specific model behavior, such as response to changes in the representation of physical processes or boundary conditions, or to validate a port of the code to a new platform.A CCSM Evaluation Run is made under the direction of PIs or working groups.
CCSM Testing Runs are very short runs (typically model days to years) carried out to verify model functionality. Any working group can carry out testing runs.
These different categories of runs produce output data of various forms as defined in Appendix A. For the purposes of the CCSM Data Management Plan, the data produced by the categories of runs outlined above will be designated as Control, Experiment, Evaluation, and Testing.
2. Ownership Rights and Responsibilities for CCSM Data
CCSM output data created with CCSM resources are considered to be owned by either the Scientific Steering Committee (SSC) or the co-chairs of the CCSM working group that designed and carried out the runs. Data owners control how the data have been checked for quality control, when the data are to be released, how the data are to be documented, and who has access to the data.
CCSM Control Runs:The data from CCSM Control Runs are the property of the SSC and managed by the CCSM Project Office.CCSM Control Runs are documented on the CCSM Experiments and Output Data Web page at http://www.cesm.ucar.edu/experiments/.This documentation includes a description of the run and pointers to the validation plots and data files from the control run.
CCSM Experiment Simulations: The data from an Experiment Simulation is initially the property of the PIs or working groups who have designed and conducted the experiment. Transfer of access to a broader community takes place over time as defined by the data access policy (see 9. below).CCSM Experiment Simulations are documented on the CCSM Experiments and Output Data Web page at http://www.cesm.ucar.edu/experiments/. This documentation includes a description of the run and pointers to the validation plots and data files from the control run.
CCSM Evaluation Runs: The data from CCSM Evaluation Runs are the property of the PIs or working groups and are considered to be non-public, internal tools.CCSM Evaluation Runs are only required to be documented to the extent needed by the PI or working group.
CCSM Testing Runs: CCSM Testing Runs are treated in the same manner as CCSM Evaluation Runs.
CCSM data owners are responsible for ensuring that the data are documented and archived in accordance with the CCSM Data Release Timeline Policy (see 3. below). This includes documenting the simulation data on the Web, keeping the data on the designated archival storage device for the specified time, and deleting the data once they have become obsolete.
When multiple working groups collaborate on a single run, one working group should be designated as the primary owner of the data.
3. The CCSM Data Release Timeline
The intellectual investment and time committed to the design and execution of a CCSM Experiment Simulation entitles the PIs to the first benefits to be obtained from the resulting data. Publication of descriptive or interpretive results derived immediately and directly from the Experiment Simulation data is the privilege and responsibility of the PIs who perform the simulation. However, to further CCSM science objectives, CCSM PIs are encouraged to share their data with colleagues prior to the release deadlines. When the release deadlines for CCSM data are reached, the data move into the public domain.
Thus, the release status of CCSM data is characterized as being Protected, CCSM Access, or Public. Protected data are owned by the PIs or working groups that created them.CCSM Access data are Protected data that have been made available to all CCSM working group members. The Protected data become CCSM Access data, then Public data by permission of the initial owners or through the expiration of the proprietary time period as defined by the CCSM Data Release Timeline policy below. Public data are open to access by the public.Experiment, Evaluation, and Testing data are owned and managed by the working groups that created them. Control data are owned by the CCSM SSC and managed by the CCSM Project.
All CCSM data are initially categorized as Protected.
CCSM Control data become Public once the control run has been validated by the Chair of the SSC.
CCSM Experiment data shall be available to members of any CCSM Working Group no later than six (6) months following the conclusion of the Experiment Simulation.
CCSM Experiment data shall become Public as soon as a scientific paper on the results has been submitted by the PIs who originated the model run or one (1) year after the end of the simulation, whichever comes sooner.
Any scientist wishing to make use of Experiment data before this date should communicate directly with the PIs who performed the simulation about access to the data. In this circumstance, it is anticipated that the PIs will be co-authors of any published results, if they wish.
All Evaluation and Testing data are classed as Protected and remain so unless the PI or Working Group decides otherwise.
All information regarding CCSM Public data availability, including appropriate references and acknowledgments, will appear on the CCSM Experiments and Output Data Web page of the CCSM Web pages. The URL is http://www.cesm.ucar.edu/experiments.
4. Stewardship of CCSM Data
The CCSM data retention policy strikes a balance between the scientific need to retain data from older CCSM simulations with the growing cost of doing so in a resource-limited environment. Unlike observational data, model simulation data often become less valuable with time as better models of higher quality are developed and run. Nevertheless, publication of scientific analyses of CCSM Control Runs and Experiment Simulations Runs continue years after the data were generated. Accordingly, data from CCSM Control Runs and Experiment Simulations Runs shall be preserved for specified time periods to allow extraction of the maximum scientific content.
CCSM data at NCAR will be retained under the guidelines of this data stewardship policy. The owners of the data are responsible for the stewardship. A similar policy for CCSM data held at other sites is encouraged.Sites holding CCSM Control or Experiment Simulations data that are Public should give the CCSM SSC the option of archiving these data at NCAR before deletion. A similar courtesy should be provided to the SSC for Public data that are slated for deletion.
Public Data from CCSM Control Runs and CCSM Experiment Simulations will be retained for a period of ten (10) years from the date of the end of the simulation.Duplicate copies of Public data owned by the SSC will be removed after five years or after the next public release, whichever comes first, unless the working group decides otherwise.
Retention periods for data from CCSM Evaluation and Testing data are left to the discretion of the PIs and the working groups.Should storage resources become an issue, the SSC reserves the right to intervene.
Develop a procedure for determining the cost of generating and maintaining CCSM data for its entire lifespan.
5. CCSM Data Format Standards
Standard data and metadata formats are essential for the automated analysis necessary to efficiently interact with large data collections.CCSM uses netCDF as the standard data format for all CCSM data. The use of netCDF makes CCSM output data readily accessible to a variety of existing graphics and analysis packages.
Input data must be in netCDF format.
All CCSM components will either create netCDF or will provide a filter to convert Raw CCSM Output Data files into netCDF.
All Post-processed CCSM data will be made available in netCDF format.
6. CCSM Metadata Requirements
In the broadest sense, metadata are simply "structured data about data," describing important attributes of an information resource. Metadata for CCSM data is carried in the header section of the CCSM model output.netCDF files.
All CCSM3 netCDF data will comply with the Climate and Forecast (CF) 1.0 metadata convention.
An automated system will be put in place to assure compliance with the CF-1.0 metadata standard as part of the CCSM testing process.
Translations of CF metadata into other metadata conventions, such as Dublin Core, ISO, and FGDC standards, may be pursued through the Community Data Portal (CDP) and the Earth System Grid (ESG) collaborations.
Develop a method to repeat any CCSM run using only the information contained in the metadata.
7. CCSM Case and File Naming Conventions
The CCSM project has adopted Case and File Name conventions to help keep track of the numerous simulations and their output data. The CCSM Software Engineering Group maintains this convention. The CCSM case and file naming conventions are outlined in the Web page at http://www.cesm.ucar.edu/experiments/filename_conventions.html.
CCSM Control and Experiment Simulations will conform to the case and file name conventions.CCSM Evaluation and Testing runs do not have to conform to the case and file name conventions.
8. CCSM Data Quality Assurance Procedures
Primary responsibility for quality control of CCSM data products lies with the PI overseeing the model integration. Currently, quality control of CCSM data is carried out by individual component working groups or by collaborations of several component working groups.
PIs are responsible for maintaining the quality and correctness of their CCSM data. The PI should address questions raised by the researchers using this data as quickly as possible.
Add Quality Control assertions, specifying such values as absolute maxima and absolute minima, to model output data fields as required by the CF convention.
Add lightweight diagnostic output streams for each component model.Document these values from CCSM Control and Experiment Simulations online.
9. Access to CCSM Data
Web technologies allow for the efficient discovery and access of CCSM data. The CCSM working groups have been very active in establishing Web portals to CCSM data subsets, both within NCAR and through DOE's collaborations. Currently, the primary distribution system for CCSM data is the Earth System Grid (ESG). Other online CCSM data distribution systems include the CDP and the GIS portal.
To maximize the ease-of-access and value of the data to the scientific community, all CCSM Public data shall be made available via the ESG (http://www.earthsystemgrid.org/).The registration process through ESG will permit the assembly of information on the users and use of the CCSM Public data. CCSM Public data may also be made available through the CDP if there is sufficient demand and support for this access.
NCAR will serve as much CCSM data online as possible. Other centers archiving and serving CCSM data are encouraged to do so as well. All sites are expected to coordinate their data services.
Restart and initial data will not be made publicly available.They can be obtained upon requests from the working group liaison.
Develop a process by which working groups can acquire and apply resources allowing their working group members to remotely access working group data housed in the archival storage devices at the various sites.
10. Registering and Auditing the Characteristics of CCSM Data Users
To measure CCSM's contribution to the UCAR scientific community, the CCSM project will collect registration information from users downloading Public CCSM data. This information will describe the user's name, contact information, institutional affiliation, and intended use of the data. Summaries of this CCSM registration information will be reported to the CCSM SSC and the CCSM planning agencies to demonstrate how the CCSM project is serving the scientific community.
The CCSM data user community will be surveyed to see how CCSM data distribution and services can best meet the needs of the community.
11. Distributed CCSM Data Repositories
CCSM Control and Experiment integrations are being carried out on a continuous basis at a number of computing centers around the world. Due to the large volume of data that is generated, no one center can support all this CCSM data. It will be necessary to coordinate CCSM data storage, discovery, and access policies among the various sites where CCSM data will be archived. This is particularly important for Public data.
Data produced by the CCSM will be stored, managed, and distributed by the data archive center appointed by the entity sponsoring the CCSM run that produces the data. CCSM data created at NCAR under NSF support will be archived on the NCAR Mass Storage System (MSS). CCSM data generated at non-NCAR facilities should be archived at either the site of generation or its associated data archive center. CCSM data created at non-NCAR sites may be archived on the NCAR MSS if prior arrangements have been made with both CCSM and NCAR's Scientific Computing Division's management.
12. Distribution of CCSM Data Products to Non-NSF Entities
Data will be made available for users who are not CCSM collaborators at the marginal cost of making and shipping the copies (not the full cost of data production and archive maintenance). However, for large data orders, we reserve the right to make special policies and perhaps ask for a data exchange that is beneficial to both sides.
13. Changes to the CCSM Data Management Plan
Recognizing that ours is an evolving field, the CCSM SSC reserves the right to change the CCSM Data Policy. Any changes will be made with respect for the resource needs of PIs with regard to the processing and distribution of information. When changes in data policy would require substantial increases in equipment, supplies, or personnel, current investigations will not be expected to comply with the changes.
APPENDIX A. CCSM Data
During the course of an integration, the CCSM produces three distinct output data streams:printed log information, restart, and history data. After a CCSM run finishes, the raw history data are post-processed into more useful collections referred to as post-processed history data.
a. Input Initial/Boundary Data
netCDF, raw binary, or ascii
b. Output Printed Data
Plain text files
c. Output Restart Data
d. Output Raw History Data
netCDF compliant with CF convention
e. Post-processed History Data
netCDF/CF, JPG images, HTML pages
Types of CCSM Output Data
a. Input Initial and Boundary Condition Data
CCSM runs are typically started using initial data that represent a known or idealized climate state for each CCSM component. Boundary condition files may also be used to prescribe time varying values of variables that are not predicted, such as the annual cycle of ozone in the atmosphere or emission profiles for future climate change scenarios.
b. Output Printed Data
The printed output contains diagnostic messages written by the various CCSM components during the course of a run. This includes a printed log file for the entire system, as well as printed log files from each of the CCSM components. The printed output's primary importance is for archiving details about the model run, how long it ran, and when it stopped and restarted. While the printed output contains little information useful for detailed model diagnostics, it provides a convenient method for displaying "quick look" diagnostics.
c. Output Restart Data
The CCSM restart data are raw binary files containing sufficient information for the CCSM to restart exactly. Restart data are usually output at monthly, half year, or yearly intervals. As the integration progresses, most old restart data are deleted to save disk space. The accepted practice is to retain restart data at decadal intervals.
d. Output Raw History Data
Raw CCSM Output Data are the original, high-volume data streams directly created by each CCSM component (at present these are atmosphere, ocean, land, sea ice, and flux coupler) during the course of a CCSM integration. The raw history data contains the model data from each component of the CCSM. The history data consists of grid point representations of three-dimensional (latitude, longitude, time) and four-dimensional (latitude, longitude, height/depth, time) model fields. These fields include such variables as surface temperature, precipitation, and ocean salinity. Output frequencies can range from minutes to months or years, and the data can represent, for instance, instantaneous values, extreme values, or average values over the output period. In total, several hundred fields are output by the CCSM components. Due to model complexity and the compromises involved with attaining peak model performance, the Raw CCSM Output Data and CCSM restart files may not conform to CCSM netCDF data format standards.
e. Post-Processed History Data
Post-processed CCSM Data are all other CCSM data products. The CCSM is a collection of distinct component models optimized for very high-speed multi-processor computing. This results in raw output data streams from each component that does not present the data in the most coordinated or user-friendly manner. While raw history data can be analyzed, the raw data packages have not allowed for easy time series analysis. For example, the atmosphere component puts all the requested variables into one large file at each requested output period. While this allows for very fast model execution, this makes it impossible to analyze time series of individual variables without having to access the entire data volume. The process of transforming the raw CCSM history output into data collections more useful for analysis is called post-processing. This step may involve reformatting the data, deriving new fields from the existing data, making averages along any or all of the data dimensions, or sampling the data in different ways. These post-processed data are usually condensed into lower volume collections that are more portable and easy to use. The Post-processed CCSM Data are the most useful data for climate analysis researchers and represents the desired results of CCSM experiments. Typically these include Raw CCSM Output Data reformatted into different time or variable packages or completely new fields derived from Raw CCSM Output Data.
This post-processing will be fully automated for all components.
APPENDIX B. CCSM Data Tools
The CCSM project uses netCDF as its data format and benefits from the large suite of software tools that support this format. Unidata (http://www.unidata.ucar.edu) has an extensive listing of software that can manipulate netCDF data.Tools that have been found to be particularly useful for analysis and visualization of CCSM data are:
Public domain tools:
NCL: NCAR Command Language
VCDAT: DOE analysis and visualization tools
ncview: A simple netCDF display tool
GrADS:A visualization and analysis tool
FERRET:A netCDF visualization tool
IDL, MatLab, etc.
APPENDIX C.The CCSM Data User Community
Users of CCSM data span a wide range of interests. An incomplete list includes:
Scientists at universities, federal laboratories, and NCAR
CCSM working groups performing production integrations
CCSM working groups performing development integrations
National Assessment program
Intergovernmental Panel on Climate Change (IPCC) Data Distribution Center
Coupled Model Intercomparison Project/Paleoclimate Model Intercomparison Project (CMIP/PMIP)
Other modeling groups using CCSM data as forcing input to their models (e.g., regional climate models)
The broad common needs of these users are ready access to the data, diagnostics of CCSM performance (scientifically and computationally), and various types of analysis and analysis tools.