MPI

Distributed parallel computations with MPI in PyOP2 require the mesh to be partitioned among the processors. To be able to compute over entities on their boundaries, partitions need to access data owned by neighboring processors. This region, called the halo, needs to be kept up to date and is therefore exchanged between the processors as required.

Local Numbering

The partition of each Set local to each process consists of entities owned by the process and the halo, which are entities owned by other processes but required to compute on the boundary of the owned entities. Each of these sections is again divided into two sections required to efficiently overlap communication and computation and avoid communication during matrix assembly as described below. Each locally stored Set entitity therefore belongs to one of four categories:

  • Core: Entities owned by this processor which can be processed without accessing halo data.

  • Owned: Entities owned by this processor which access halo data when processed.

  • Exec halo: Off-processor entities which are redundantly executed over because they touch owned entities.

  • Non-exec halo: Off-processor entities which are not processed, but read when computing the exec halo.

The following diagram illustrates the four sections for a mesh distributed among two processors:

_images/pyop2_mpi_mesh.svg

A mesh distributed among two processors with the entities of each mesh partition divided into core, owned, exec halo and non-exec halo. Matching halo sections are highlighted in matching colours. The owned section of process 0 correspondonds to the non-exec section of process 1.

For data defined on the Set to be stored contiguously per section, local Set entities must be numbered such that core entities are first, followed by owned, exec halo and non-exec halo in that order. A good partitioning maximises the size of the core section and minimises the halo regions. We can therefore assume that the vast majority of local Set entities are in the core section.

Computation-communication Overlap

The ordering of Set entities into four sections allow for a very efficient overlap of computation and communication. Core entities that do not access any halo data can be processed entirely without access to halo data immediately after the halo exchange has been initiated. Execution over the owned and exec halo regions requires up to date halo data and can only start once the halo exchange is completed. Depending on the latency and bandwidth of communication and the size of the core section relative to the halo, the halo exchange may complete before the computation on the core section.

The entire process is given below:

halo_exchange_begin()                      # Initiate halo exchange
maybe_set_dat_dirty()                      # Mark Dats as modified
compute_if_not_empty(itset.core_part)      # Compute core region
halo_exchange_end()                        # Wait for halo exchange
compute_if_not_empty(itset.owned_part)     # Compute owned region
reduction_begin()                          # Initiate reductions
if needs_exec_halo:                        # Any indirect Dat not READ?
    compute_if_not_empty(itset.exec_part)  # Compute exec halo region
reduction_end()                            # Wait for reductions
maybe_set_halo_update_needed()             # Mark halos as out of date
assemble()                                 # Finalise matrix assembly

Any reductions depend on data from the core and owned sections and are initiated as soon as the owned section has been processed and execute concurrently with computation on the exec halo. Similar to halo_exchange_begin and halo_exchange_end, reduction_begin and reduction_end do no work at all if none of the par_loop() arguments requires a reduction. If the par_loop() assembles a Mat, the matrix assembly is finalised at the end.

By dividing entities into sections according to their relation to the halo, there is no need to check whether or not a given entity touches the halo or not during computations on each section. This avoids branching in kernels or wrapper code and allows launching separate kernels for GPU execution of each section. The par_loop() execution therefore has the above structure for all backends.

Halo exchange

Exchanging halo data is only required if the halo data is actually read, which is the case for Dat arguments to a par_loop() used in pyop2.READ or pyop2.RW mode. PyOP2 keeps track whether or not the halo region may have been modified. This is the case for Dats used in pyop2.INC, pyop2.WRITE or pyop2.RW mode or when a Solver or a user requests access to the data. A halo exchange is triggered only for halos marked as out of date.

Distributed Assembly

For an MPI distributed matrix or vector, assembling owned entities at the boundary can contribute to off-process degrees of freedom and vice versa.

There are different ways of accounting for these off-process contributions. PETSc supports insertion and subsequent communication of off-process matrix and vector entries, however its implementation is not thread safe. Concurrent insertion into PETSc MPI matrices is thread safe if off-process insertions are not cached and concurrent writes to rows are avoided, which is done through colouring as described in Colouring.

PyOP2 therefore disables PETSc’s off-process insertion feature and instead redundantly computes over all off process entities that touch local dofs, which is the exec halo section described above. The price for this is maintaining a larger halo, since we also need halo data, the non-exec halo section, to perform the redundant computation. Halos grow by about a factor two, however in practice this is still small compared to the interior region of a partition and the main cost of halo exchange is the latency, which is independent of the exchanged data volume.