The testpio directory, included with the release package, tests both the accuracy and performance of reading and writing data using the pio library.
The testpio directory contains 3 perl scripts that you can use to build and run the testpio.F90 code.
Additional C shell scripts wrappers are packaged with the testpio suite to allow for environment customization of the 3 perl scripts listed above. The following help information describes in more detail how the testpio code works.
The tests are controlled via a namelist. Sample namelist files are located in the testpio/namelists directory. It contains a set of general namelists and specific namelists to setup a computational decomposition and an IO decomposition. The computational decomposition should be setup to duplicate a realistic model data decomposition. The IO decomposition is generally not used, but in some cases, can be used and impacts IO performance. The IO decomposition is an intermediate decomposition that provides compatability between a relative arbitrary computational decomposition and the MPI-IO, netcdf, pnetcdf, or other IO layers. Depending on the IO methods used, only certain IO decompositions are valid. In general, the IO decomposition is not used and is set internally.
The namelist input file is called "testpio_in". The first namelist block, io_nml, contains some general settings:
|casename||string, user defined test case name|
|nx_global||integer, global size of "x" dimension|
|ny_global||integer, global size of "y" dimension|
|nz_global||integer, glboal size of "z" dimension|
|ioFMT||string, type and i/o method of data file ("bin","pnc","snc"), binary, pnetcdf, or serial netcdf|
|rearr||string, type of rearranging to be done ("none","mct","box","boxauto")|
|nprocsIO||integer, number of IO processors used only when rearr is not "none", if rearr is "none", then the IO decomposition will be the computational decomposition|
|base||integer, base pe associated with nprocIO striding|
|stride||integer, the stride of io pes across the global pe set. A stride=-1 directs PIO to calculate the stride automatically.|
|num_aggregator||integer, mpi-io number of aggregators, only used if no pio rearranging is done|
|dir||string, directory to write output data, this must exist before the model starts up|
|num_iodofs||tests either 1dof or 2dof init decomp interfaces (1,2)|
|maxiter||integer, the number of trials for the test|
|DebugLevel||integer, sets the debug level (0,1,2,3)|
|compdof_input||string, setting of the compDOF ('namelist' or a filename)|
|compdof_output||string, whether the compDOF is saved to disk ('none' or a filename)|
Two other namelist blocks exist to described the computational and IO decompositions, compdof_nml and iodof_nml. These namelist blocks are identical in use.
|namelist compdof_nml - or - iodof_nml|
|nblksppe||integer, sets the number of blocks desired per pe, the default is one per pe for automatic decomposition. increasing this increases the flexibility of decompositions.|
|grdorder||string, sets the gridcell ordering within the block ("xyz","xzy","yxz","yzx","zxy","zyx")|
|grddecomp||string, sets up the block size with gdx, gdy, and gdz, see below, ("x","y","z","xy","xye","xz","xze","yz","yze", "xyz","xyze","setblk")|
|gdx||integer, "x" size of block|
|gdy||integer, "y" size of block|
|gdz||integer, "z" size of block|
|blkorder||string, sets the block ordering within the domain ("xyz","xzy","yxz","yzx","zxy","zyx")|
|blkdecomp1||string, sets up the block / processor layout within the domain with bdx, bdy, and bdz, see below. ("x","y","z","xy","xye","xz","xze","yz","yze","xyz","xyze", "setblk","cont1d","cont1dm")|
|blkdecomp2||string, provides an additional option to the block decomp after blkdecomp1 is computes ("","ysym2","ysym4")|
|bdx||integer, "x" numbers of contiguous blocks|
|bdy||integer, "y" numbers of contiguous blocks|
|bdz||integer, "z" numbers of contiguous blocks|
A description of the decomposition implementation and some examples are provided below.
Testpio writes out several files including summary information to stdout, data files to the namelists directory, and a netcdf file summarizing the decompositions. The key output information is written to stdout and contains the timing information. In addition, a netcdf file called gdecomp.nc is written that provides both the block and task ids for each gridcell as computed by the decompositions. Finally, foo.* files are written by testpio using the methods specified.
Currently, the timing information is limited to the high level pio read/write calls which generally will also include copy and rearrange overhead as well as actual I/O time. Addition timers will be added in the future.
The test script is called testpio_run.pl, it uses the hostname function to determine the platform. New platforms can be added by editing the files build_defaults.xml and Utils.pm. If more than one configuration should be tested on a single platform you can provide two hostnames in this file and specify the name to test in a –host option to testpio_run.pl
There are several testpio_in files for the pio test suite. The ones that come with pio test specific things. In general, these tests include:
PIO can use several backend libraries including netcdf, pnetcdf and mpi-io. For each library used, a compile time cpp flag is defined (eg _USEMPIIO). The test suite builds and tests the model for several combinations of these cpp flags.
The decomposition implementation supports the decomposition of a general 3 dimensional "nx * ny * nz" grid into multiple blocks of gridcells which are then ordered and assigned to processors. In general, blocks in the decomposition are rectangular, "gdx * gdy * gdz" and the same size, although some blocks around the edges of the domain may be smaller if the decomposition is uneven. Both gridcells within the block and blocks within the domain can be ordered in any of the possible dimension hierarchies, such as "xyz" where the first dimension is the fastest.
The gdx, gdy, and gdz inputs allow the user to specify the size in any dimension and the grddecomp input specifies which dimensions are to be further optimized. In general, automatic decomposition generation of 3 dimensional grids can be done in any of possible combination of dimensions, (x, y, z, xy, xz, yz, or xyz), with the other dimensions having a fixed block size. The automatic generation of the decomposition is based upon an internal algorithm that tries to determine the most "square" blocks with an additional constraint on minimizing the maximum number of gridcells across processors. If evenly divided grids are desired, use of the "e" addition to grddecomp specifies that the grid decomposition must be evenly divided. The setblk option uses the prescibed gdx, gdy, and gdz inputs without further automation.
The blkdecomp1 input works fundamentally the same way as the grddecomp in mapping blocks to processors, but has a few additional options. "cont1d" (contiguous 1d) basically unwraps the blocks in the order specified by the blkorder input and then decomposes that "1d" list of blocks onto processors by contiguously grouping blocks together and allocating them to a processor. The number of contiguous blocks that are allocated to a processor is the maximum of the values of bdx, bdy, and bdz inputs. Contiguous blocks are allocated to each processor in turn in a round robin fashion until all blocks are allocated. The "cont1dm" does basically the same thing except the number of contiguous blocks are set automatically such that each processor recieves only 1 set of contiguous blocks. The ysym2 and ysym4 blkdecomp2 options modify the original block layout such that the tasks assigned to the blocks are 2-way or 4-way symetric in the y axis.
The decomposition tool is extremely flexible, but arbitrary inputs will not always yield valid decompositions. If a valid decomposition cannot be computed based on the global grid size, number of pes, number of blocks desired, and decomposition options, the model will stop.
As indicated above, the IO decomposition must be suited to the IO methods, so decompositions are even further limited by those constraints. The testpio tool provides limited checking about whether the IO decomposition is valid for the IO method used. Since the IO output is written in "xyz" order, it's likely the best IO performance will be achieved with both grdorder and blkorder set to "xyz" for the IO decomposition.
Also note that in all cases, regardless of the decomposition, the global gridcell numbering and ordering in the output file is assumed to be "xyz" and defined as a single block. The number scheme in the examples below demonstrates how the namelist input relates back to the grid numbering on the local computational decomposition.
Some decomposition examples:
Standard xyz ordering, 2d decomp: note: blkdecomp plays no role since there is 1 block per pe
nx_global 6 ny_global 4 nz_global 1 ______________________________ npes 4 |B3 P3 |B4 P4 | nblksppe 1 | | | grdorder "xyz" | | | grddecomp "xy" | | | gdx 0 | | | gdy 0 |--------------+---------------| gdz 0 |B1 P1 |B2 P2 | blkorder "xyz" | 4 5 6 | 4 5 6 | blkdecomp1 "xy" | | | blkdecomp2 "" | | | bdx 0 | 1 2 3 | 1 2 3 | bdy 0 |______________|_______________| bdz 0
Same as above but yxz ordering, 2d decomp note: blkdecomp plays no role since there is 1 block per pe
nx_global 6 ny_global 4 nz_global 1 _____________________________ npes 4 |B2 P2 |B4 P4 | nblksppe 1 | | | grdorder "yxz" | | | grddecomp "xy" | | | gdx 0 | | | gdy 0 |--------------+--------------| gdz 0 |B1 P1 |B3 P3 | blkorder "yxz" | 2 4 6 | 2 4 6 | blkdecomp1 "xy" | | | blkdecomp2 "" | | | bdx 0 | 1 3 5 | 1 3 5 | bdy 0 |______________|______________| bdz 0
xyz grid ordering, 1d x decomp note: blkdecomp plays no role since there is 1 block per pe note: blkorder plays no role since it's a 1d decomp
nx_global 8 ny_global 4 nz_global 1 _____________________________________ npes 4 |B1 P1 |B2 P2 |B3 P3 |B4 P4 | nblksppe 1 | 7 8 | 7 8 | | | grdorder "xyz" | | | | | grddecomp "x" | | | | | gdx 0 | 5 6 | 5 6 | | | gdy 0 | | | | | gdz 0 | | | | | blkorder "yxz" | 3 4 | 3 4 | | | blkdecomp1 "xy" | | | | | blkdecomp2 "" | | | | | bdx 0 | 1 2 | 1 2 | | | bdy 0 |________|_________|________|_________| bdz 0
yxz block ordering, 2d grid decomp, 2d block decomp, 4 block per pe
nx_global 8 ny_global 4 nz_global 1 _____________________________________ npes 4 |B4 P2 |B8 P2 |B12 P4 |B16 P4 | nblksppe 4 | | | | | grdorder "xyz" |-----—+------—+-----—+------—| grddecomp "xy" |B3 P2 |B7 P2 |B11 P4 |B15 P4 | gdx 0 | | | | | gdy 0 |-----—+------—+-----—+------—| gdz 0 |B2 P1 |B6 P1 |B10 P3 |B14 P3 | blkorder "yxz" | | | | | blkdecomp1 "xy" |-----—+------—+-----—+------—| blkdecomp2 "" |B1 P1 |B5 P1 |B9 P3 |B13 P3 | bdx 0 | 1 2 | 1 2 | | | bdy 0 |________|_________|________|_________| bdz 0