next up previous contents index
Next: 10 Post Processing Up: UsersGuide Previous: 8 Performance   Contents   Index


9 Troubleshooting (Log Files, Aborts and Errors)

This section describes some common aborts and errors, how to find them, and how to fix them. General suggestions are provided below. Model failures occur for many reasons, however, not the least of which is that the science is ``going bad''. Users are encouraged to constantly monitor the results of their runs to preempt failures. A few of the most frequent errors and aborts are documented below.

9.1 Error messages and log files

CCSM3 does not always handling log and error messages in an optimal manner since numerous challenges exist in gracefully handling messages for a multiple executable model. We hope to fully address this problem in the future. For now, there are several items to be aware of:

9.2 General Issues

There are a number of general issues users should be constantly aware of that can cause problems.

9.3 Model time coordination

An error message like ``ERROR: models uncoordinated in time'' is generated by the coupler and indicates that the individual component start dates do not agree. This occurs when an inconsistent set of restart files are being. Users should confirm that the restart files used are consistent, complete, and not corrupted. If users are trying to execute a branch run, they should review the constraints associated with a branch run and possibly consider a hybrid run if inconsistent restart files are desired.

9.4 POP ocean model non-convergence

An error message like XX indicates the POP model has failed to converge. See the POP model users guide for more information. This typically occurs when the CFL number is violated in association with some unusually strong forcing in the model.

The solution to this problem is often to reduce the ocean model timestep. This pop timestep is set in the Buildnml_prestage/pop.buildnml_prestage.csh file. The value associated with DT_COUNT should be increased in the sed command. DT_COUNT indicates the number of timesteps per day taken in POP. A higher number of DT_COUNT means a shorter timestep. Initially, users should increase DT_COUNT by 10 to 20 percent then restart their run. This will slow the POP model down by an equivalent amount. The results will not be bit-for-bit but should be climate continuous. Users can reset the timestep as needed to trade-off model stability and performance.

9.5 CSIM model failures

An error message like XX is from the ice model and usually indicates a serious model problem. These errors normally arise from bad forcing or coupling data in CSIM. Users should review their setup and model results and search for scientific problems in CSIM or other models.

9.6 CAM Courant limit warning messages

An error message like

COURLIM: *** Courant limit exceeded at k,lat= 1 62 (estimate = 1.016),

in the CAM log file is typical. This means CAM is approaching a CFL limit and is invoking an additional internal truncation to minimize the impact. These warning messages can be ignored unless a model failure has occurred.

9.7 CAM model stops due to non-convergence

The CAM model has been observed to halt due to high upper level winds exceeding the courant limit associated with the default resolution and timestep. The atmosphere log file will contain messages similar to:

NSTEP = 7203774 8.938130967099877E-05 7.172242117666898E-06 251.2209.84438
COURLIM: *** Courant limit exceeded at k,lat= 1 62 (estimate = 1.016)
solution has been truncated to wavenumber 41 ***
*** Original Courant limit exceeded at k,lat= 1 62 (estimate = 1.016) ***

If changes have been introduced to the standard model, this abort may be due to the use of too large of a timestep for the changes. The historical solution to this problem is to increase the horizontal diffusivity or the kmxhdc parameter in CAM for a few months in the run then reset it back to the original value. However, this solution has impact on the climate results. If changes such as these are undesirable, please contact the CAM scientific liaison at NCAR for further assistance.

To change the horizontal diffusivity, add the CAM namelist variable HDIF4 to the CAM namelist in Buildnml_prestage/cam.buildnml_prestage.csh and set it to a value that is larger than the default value for the current case. The default value can be found in the CAM log file.

Finally, if the failure occurs in the first couple of days of an startup run, the CAM namelist option DIVDAMPN can be used to temporarily increase the divergence dampening (a typical value to try would be 2.).

9.8 T31 instability onset

At T31 resolution (e.g. T31_gx3v5), strong surface forcing can lead to instability onsets that result in model failure. This has been know to occur in several paleo climate runs. This failure has been linked to the interaction of the default model timestep in CAM and CLM (1800 secs) and the default coupling interval between CAM and CLM (1 hour). A solution to this problem has been to decrease the CAM and CLM model timesteps to 1200 seconds. The user also needs to reset the RTM averaging interval, rtm_nsteps, in the CLM namelist from 6 to 9 to be consistent with the current namelist defaults.

next up previous contents index
Next: 10 Post Processing Up: UsersGuide Previous: 8 Performance   Contents   Index