NCAR / CGD / CSM / PALEO


PaleoCSM Frequently Asked Questions



1. I have my scripts set up as specified in the Tutorial, but the CRAY doesn't seem to be finding my home directory? Why?

Your home directory should be either on the CRAY itself, or on a machine which is cross-mounted to it.

If your setup scripts reside in a directory which is only cross-mounted to the CRAY, then you will have to create a symoblic link from that directory to your $HOME on the CRAY. For example, if your files are on middlepark in /fs/univ , then you will need to create a symbolic link from /fs/univ/{username}/csm/ to /home/antero/{username}/csm . This way, the CRAY will recognize either as your $HOME .

2. I'm running on the CRAYs, and the model keeps crashing with "not enough space" ..."

Check your INODES . On the CRAY, type quota -v to find out how much temporary space you have. You could have filled up your Inodes allocation. The CSM takes up a large amount of /tmp space when running, you may need to request more /tmp space. This is especially true if you are running more than one simulation at the same time.

3. I'm running the model on the J90. The model gets stuck and crashes with no error message...this doesn't happen on the C90...what could possibly be wrong?

Sometimes error are hard to track down. It is a good general rule to make sure your quota's and UDB limits on all your CRAY accounts are large enough to run the CSM. type, crayinfo -u {user} to see these statistics. UDBSEE also gives your udb limits. Your "Processes" (JPROCLIM) may not be large enough. Your memory limits may also need to be increased.

Note that limits are set differently for batch jobs vs. interactive jobs. Generally, limits for interactive are set low. If you are debugging your job, and are running interactive, you may run into this problem. Try submitting your job in the batch queue instead.

4. My error message says "no more processes"...?

See answer to number 3.

5. I keep getting a "tfork" error...what does this mean?

Usually, tfork errors have to do with not specifying the correct number of processors per model. This is set with the NCPUS variable in the .nqs script. Note that NCPUS per model are machine dependent.

6. What does "broken pipe" mean?

When one model crashes, it can't communicate with the rest of the CSM. You will get a this message when one component of the model can't communicate with another component.

7. The model seemed to be running ok, then produced a "timeout" message...what happened?

There is a maximum amount of time one model component will wait for communication from another model component. This is set with the "-t" option in the executable line(s) in the .nqs script. Here is a sample executable line: env NCPUS=16 atm -l 1 -t 600 < atm.parm >> & ! atm.log &
When the "-t" limit is reached and communication has not occurred, that model component will terminate and send this message.

8. I tried restarting the model after the CRAY crashed, but the coupler failed with the line "models uncoordinated in time"... how do i fix this?

In your $HOME/csm-rpointers directory, you will have a restart file for each component of the model. Sometimes, when the machine crashes unexpectedly, not all restart files have been updated. You need to go through each component restart file and make sure they all contain the same date.


For any further questions, email
shields@ucar.edu .



This page is still under construction.

This page is maintained by Christine Shields ( shields@ncar.ucar.edu )