This section provides a brief overview of CCSM performance issues.
As is discussed in section 5.2, compiler flags are set in the $CCSMROOT/models/bld/Macros.* files. The default flags were chosen carefully in order to produce exact restart and valid climate results for various combinations of processors, parallelization, and components. They were also chosen for performance and trapping characteristics as well as confidence. Many tradeoffs were made with respect to these choices. Users are free to modify the compiler flags in the Macros files. However, CCSM strongly suggests that users carry out a suite of exact restart tests, parallelization, performance, and science validations prior to using a modified Macros file for science.
Patches will be provided as machines and compilers evolve. There is no guarantee about either the scientific or software validity of the compiler flags on machines outside CCSM's control. The community should be aware that machines and compilers will evolve, change, and occasionally break. Users should validate any CCSM run on a platform before documenting the resulting science (see section 7).
Model parallelization types and accompanying capabilities critical for optimizing the CCSM load balance are summarized in the following table. For each component, a summary is provided of the component's parallelization options and if answers are bit-for-bit when running that component with different processor counts.
The following lists limitations on component parallelization:
Some architectures and compilers do not support OpenMP threads. Therefore, OpenMP threading may not be used on a given architectures even if a given component supports its use. Also, threading will not be supported on architectures that may support threading if your compiler does not support threading.
CCSM load balance refers to the allocation of processors to different components such that efficient resource utilization occurs for a given model case a and the resulting throughput is in some sense optimized. Because of the constraints in how processors can be allocated efficiently to different components, this usually results in a handful of ``sweet spots'' for processor usage for any given component set, resolution, and machine.
CCSM components run independently and are tied together only through MPI communication with the coupler. For example, data sent by the atm component to the land component is sent first to the coupler component which then sends the appropriate data to each land component process. The coupler component communicates with the atm, land, and ice components once per hour and with the ocean only once a day. The overall coupler calling sequence currently looks like
Coupler ------- do i=1,ndays ! days to run do j=1,24 ! hours if (j.eq.1) call ocn_send() call lnd_send() call ice_send() call ice_recv() call lnd_recv() call atm_send() if (j.eq.24) call ocn_recv() call atm_recv() enddo enddo
For scientific reasons, the coupler receives hourly data from the land and ice models before receiving hourly data from the atmosphere. Because of this execution sequence, it is important to allocate processors in a way that assures that atm processing is not held up waiting for land or ice data. It is easy to naively allocate processors to components in such a way that unnecessary time is spent blocking on communication and idle processors result.
While the coupler is largely responsible for inter-component communication, it also carries out some computations such as flux calculations and grid interpolations. These are not indicated in the above pseudo-code.
Since all MPI ``sends'' and ``receives'' are blocked, the components might wait during the send and/or receive communication phase. Between the communication phases, each component carries out its internal computations. In general, a components time loop looks like:
General Physical Component -------------------------- do i=1,ndays do j=1,24 call compute_stuff_1() call cpl_recv() call compute_stuff_2() call cpl_send() call compute_stuff_3() enddo enddo
So, compute_stuff_1 and compute_stuff_3 are carried out between the send and the receive, and compute_stuff_2 is carried out between the receive and send. This results in a communication pattern that is represented below. We note that for each ocean communication, there are 24 ice, land, and atm communications. However, aggregated over a day, the communication pattern can be represented schematically below and serves as a template for load balancing CCSM3.
ocn r---------------------------s ^ | ice ^ r s | ^ ^ | | lnd ^ r ^ | s | ^ ^ ^ | | | atm ^ ^ ^ | | r s | ^ ^ ^ v v ^ v v cpl s-s--s---r---r---s--------r-r time-> s = send r = recv
CCSM3 runtime statistics can be found in coupler, csim, and pop
log files whereas cam and clm create files of the form
timing.<processID> containing timing statistics. As an
example, near the end of the coupler log file, the line
(shr_timer_print) timer 2: 1 calls, 355.220s, id: t00 - main integration
indicates that 355 seconds were spent in the main time integration
integration loop. This time is also referred to as the ``stepon''
time. Simply put, load balancing involves reassigning processors to
components so as to minimize this statistic for a given number of
processors. Due to the CCSM processing sequences, it is impossible to
keep all processors 100
% busy. Generally, a well balanced
configuration will show that the atm and ocean processors are well
utilized whereas the ice and land processors may indicate considerable
idle time. It is more important to keep the atm and ocean processors
busy as the number of processors assigned to atm and ocean is much
larger than the number assigned to ice and land.
The script getTiming.csh, in the directory $CCSMROOT/scripts/ccsm_utils/Tools/getTiming, can be used to aid in the collection of run time statistics needed to examine the load balance efficiency.
The following examples illustrate some issues involved in load balancing a CCSM3 run for a T42_gx1v3 run on bluesky.
Case LB1 LB2 ==================================== OCN cpus 40 48 ATM cpus 32 40 ICE cpus 16 20 LND cpus 8 12 CPL cpus 8 8 total CPUs 104 128 stepon 336 280 node seconds 34944 35840 simulated yrs/day 7.05 8.45 simulated yrs/day/cpu .067 .066
In the above example, adding more processors in the correct balance resulted in an ensemble that was "faster" (computed more years per wall clock day) and statistically just as efficient (years per day per cpu). The example below shows that assigning more processors to a given run may speed up that run (generates more simulated years per day) but may be less processor efficient.
Case LB3 LB4 ==================================== OCN cpus 32 48 ATM cpus 16 40 ICE cpus 8 20 LND cpus 4 12 CPL cpus 4 8 total CPUs 64 128 stepon 471 280 node seconds 30144 35840 simulated yrs/day 5.03 8.45 simulated yrs/day/cpu .078 .066
Learning how to analyze run time statistics and properly assign processors to components takes considerable time and is beyond the scope of this document.