1. I have my scripts set up as specified in the Tutorial, but
the CRAY doesn't seem to be finding my home directory? Why?
Your home directory should be either on the CRAY itself, or on a
machine which is cross-mounted to it.
If your setup scripts reside in a directory which is only cross-mounted
to the CRAY, then
you will have to create a symoblic link from that directory to
your $HOME on the CRAY. For example, if your files are on middlepark
in /fs/univ , then you will need to create a symbolic link
from /fs/univ/{username}/csm/ to /home/antero/{username}/csm .
This way, the CRAY
will recognize either as your $HOME .
2. I'm running on the CRAYs, and the model keeps crashing with "not enough space" ..."
Check your INODES . On the CRAY, type quota -v to find out
how much temporary space you have. You could have filled up your Inodes
allocation. The CSM takes up a large amount of /tmp space when running,
you may need to request more /tmp space. This is especially true if
you are running more than one simulation at the same time.
3. I'm running the model on the J90. The model gets stuck and crashes with no error message...this doesn't happen on the C90...what could
possibly be wrong?
Sometimes error are hard to track down. It is a good general rule to make
sure your quota's and UDB limits on all your CRAY accounts are large enough
to run the CSM.
type, crayinfo -u {user} to see these statistics. UDBSEE
also gives your udb limits. Your "Processes" (JPROCLIM) may not
be large enough. Your memory limits may also need to be increased.
Note that limits are set differently for batch jobs vs. interactive jobs. Generally,
limits for interactive are set low. If you are debugging your job,
and are running interactive, you may run into this problem. Try
submitting your job in the batch queue instead.
4. My error message says "no more processes"...?
See answer to number 3.
5. I keep getting a "tfork" error...what does this mean?
Usually, tfork errors have to do with not specifying the correct number
of processors per model. This is set with the NCPUS variable in the .nqs script. Note that
NCPUS per model are machine dependent.
6. What does "broken pipe" mean?
When one model crashes, it can't communicate with the rest of the CSM.
You will get a this message when one component of the model can't
communicate with another component.
7. The model seemed to be running ok, then produced a "timeout" message...what happened?
There is a maximum amount of time one model component will wait for communication from
another model component. This is set with the "-t" option in the executable line(s) in the .nqs script. Here is a sample executable line:
env NCPUS=16 atm -l 1 -t 600 < atm.parm >> & ! atm.log &
8. I tried restarting the model after the CRAY crashed, but the coupler
failed with the line "models uncoordinated in time"... how do i fix this?
In your $HOME/csm-rpointers directory, you will have a restart file for each
component of the model. Sometimes, when the machine crashes unexpectedly, not
all restart files have been updated. You need to go through each component
restart file and make sure they all contain the same date.
This page is maintained by Christine Shields
( shields@ncar.ucar.edu )
PaleoCSM Frequently Asked Questions
When the "-t" limit is reached and communication has not occurred, that model component will terminate and send this message.
For any further questions, email shields@ucar.edu .
This page is still under construction.