PIO  1.7.1
 All Classes Files Functions Groups Pages
Describing decompositions

On of the biggest challenges to working with PIO is setting up the call to PIO_initdecomp. The user must properly describe how the data within each MPI tasks memory should be placed or retrieved from disk. PIO provides several interfaces. We describe the simplest interface first and then progress to the most complex and flexible interface.

Block-cyclic interface

The simplest interface assumes that your arrays are decomposed in a block-cyclic structure and can be described simply using a start-count type approach. A simple block-cyclic decomposition for a 1-dimension (1D) array is illustrated in Figure 1. Note that contigious layout of the data in memory can be easily mapped to a contigious layout on disk. The start arguments correspond to the starting point of the block of contigious memory, while the count is the number of words. Note that start and count must be arrays of length equal to the dimensionality of the distributed array. While we use a 1D array for simplicity, PIO currently supports up to the Fortran limit of 7-dimension (7D) arrays. In the case of 7D arrays, the start and count arrays would be of length 7.

block-cyclic.png
Figure 1: Setting up the start and count arrays for a single 1D array distributed accross 3 MPI tasks.

The call to PIO_initdecomp that would implement the decomposition illustrated in Figure 1 is listed below. The variable iosystem is created by the call to PIO_init. The second argument PIO_double is the PIO kind, and indicates that this is a decomposition for a 8-byte real. (For a list of supported kinds see PIO_kinds.) The argument dims is the global dimension for the array. The start and count arrays are 8-byte integers of type PIO_OFFSET, while iodesc is the IO descriptor generated by the call to PIO_initdecomp.

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (kind=PIO_OFFSET)  :: start(1), count(1)
 type (io_desc_t)           :: iodesc
 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 0
 !---------------------------------------
 dims(1) = 8
 start(1) = 3
 count(1) = 3
 call PIO_initdecomp(iosystem,PIO_double,dims,start,count,iodesc)

Controlling IO decomposition

The above example represents the simplest way to initialize and use PIO to write out and read distributed arrays. However, PIO provides some additional features that allows greater control over the IO process. In particular, it provides the ability to define an IO decomposition. Note that a user defined IO decomposition is optional. If one is not provided and rearrangement is necessary, PIO will internally compute an IO decomposition. The reason an IO decomposition may be necessary is described in the section Degree of freedom interface below.

This flexibility provides the ability to define an intermediate decomposition that is unique from the computational decomposition. This IO decomposition can be constructed to maximize the write or read performance to the disk subsystem. We extend the simple example in Figure 1 to include an IO decomposition in Figure 1b.

block-cyclic-rearr.png
Figure 1b: Block cyclic decomposition with rearrangement

Figure 1b illustrates the creation of an IO decomposition on two of the MPI tasks. For this decomposition, the 8 word IO decomposition array and corresponding disk layout are evenly distributed between PE 0 (yellow) and PE 2 (blue). The arrows in Figure 1b indicates rearrangement that is performed within the PIO library. In this case, PE 0 sends a word to PE 2, illustrated by the shading of yellow to blue, while PE 1 sends two words to PE 0 as illustrated by the shading of red to yellow. The rearranged array in the IO decomposition is subsequently written to disk. Note that in this case, only two of three MPI tasks are performing writes to disk. The number of MPI tasks involved in IO to disk is specified in the PIO_init using a combination of the num_aggregator and stride parameters. For figure 1b, the num_aggregator=3 and the stride=2. PIO allows the user to specify the IO decomposition using the optional parameters iostart and iocount. The following bits of code for PE 0, PE 1, and PE 2 illustrates the necessary calls to PIO_initdecomp.

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (kind=PIO_OFFSET)  :: compstart(1), compcount(1)
 integer (kind=PIO_OFFSET)  :: iostart(1), iocount(1)
 type (io_desc_t)           :: iodesc
 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 0
 !---------------------------------------
 dims(1) = 8
 compstart(1) = 3
 compcount(1) = 3
 iostart(1) = 1
 iocount(1) = 4
 call PIO_initdecomp(iosystem,PIO_double,dims,compstart,compcount,iodesc,iostart=iostart,iocount=iocount)

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (kind=PIO_OFFSET)  :: compstart(1), compcount(1)
 type (io_desc_t)           :: iodesc
 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 1
 !---------------------------------------
 dims(1) = 8
 compstart(1) = 1
 compcount(1) = 2
 call PIO_initdecomp(iosystem,PIO_double,dims,compstart,compcount,iodesc)

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (kind=PIO_OFFSET)  :: compstart(1), compcount(1)
 integer (kind=PIO_OFFSET)  :: iostart(1), iocount(1)
 type (io_desc_t)           :: iodesc
 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 2
 !---------------------------------------
 dims(1) = 8
 compstart(1) = 6
 compcount(1) = 3
 iostart(1) = 5
 iocount(1) = 4
 call PIO_initdecomp(iosystem,PIO_double,dims,compstart,compcount,iodesc,iostart=iostart,iocount=iocount)

Degree of freedom interface

The interface described in Section Block-cyclic interface, while simple and used by both pNetCDF (http://trac.mcs.anl.gov/projects/parallel-netcdf) and NetCDF-4 (http://www.unidata.ucar.edu/software/netcdf/) can be insufficient for applications with non-trivial decompositions. While it is possible to use multiple calls to construct a file with a non-trivial decomposition, the performance penalty may be significant. Therefore, PIO provides a more general interface to PIO_initdecomp based on the degree of freedom concept. Each word within the distributed array must be given a unique value that corresponds to its order placement in the file on disk. So, the first word in the file on disk has a dof of 1, the second 2, etc. This allows a fully general specification of the decomposition. We illustrate its use in Figure 2. Note that in Figure 2, PE 0 and PE 1 do not contain contiguous pieces of the distributed array. The desired order on disk must be specified using the the compDOF argument to PIO_initdecomp. In this case PE 0 contains the 2nd, 4th, and 5th element of the array, PE 1 contains the 1st and 3rd, and PE 2 contains the 6th, 7th, and 8th elements of the array. The integer compDOF arrays for each MPI task is illustrated at the bottom of Figure 2.

dof.png
Figure 2: Setting up the comDOF arrays for a single 1D array distributed accross 3 MPI tasks.

The call to PIO_initdecomp which implements Figure 2 on PE 0 is provided below.

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (i4)               :: compdof
 type (io_desc_t)           :: iodesc
 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 0
 !---------------------------------------
 dims(1) = 8
 compdof = (/2,4,5/)
 call PIO_initdecomp(iosystem,PIO_double,dims,compdof,iodesc)

As with the block-cyclic interface, the degree of freedom interface provides the ability to specify the io decomposition through optional arguments to PIO_initdecomp.

dof-rearr.png
Figure 3: Setting up the comDOF arrays and setting IO decomposition for a single 1D array distributed accross 3 MPI tasks and written from 2 tasks after rearrangement within the PIO library

Figure 2 illustrates the inclusion of an IO decomposition and associated rearrangement to write out the distributed array. The shading of the array elements shows how the individual PE arrays are blended using the IO decomposition specifications. The subroutine call to PIO_initdecomp for PE 0 is illustrated below:

 type (iosystem_desc_t)     :: iosystem
 integer (i4)               :: dims(1) 
 integer (i4)               :: compdof
 type (io_desc_t)           :: iodesc
 integer (kind=PIO_OFFSET)  :: iostart(:),iocount(:)

 ...
 !---------------------------------------
 ! Initializing the decomposition on PE 0
 !---------------------------------------
 dims(1) = 8
 compdof = (/2,4,5/)
 iostart(1) = 1
 iocount(1) = 4
 call PIO_initdecomp(iosystem,PIO_double,dims,compdof,iodesc,iostart=iostart,iocount=iocount)