.. _mpi: MPI === Distributed parallel computations with MPI in PyOP2 require the mesh to be partitioned among the processors. To be able to compute over entities on their boundaries, partitions need to access data owned by neighboring processors. This region, called the *halo*, needs to be kept up to date and is therefore exchanged between the processors as required. Local Numbering --------------- The partition of each :class:`~pyop2.Set` local to each process consists of entities *owned* by the process and the *halo*, which are entities owned by other processes but required to compute on the boundary of the owned entities. Each of these sections is again divided into two sections required to efficiently overlap communication and computation and avoid communication during matrix assembly as described below. Each locally stored :class:`~pyop2.Set` entitity therefore belongs to one of four categories: * **Core**: Entities owned by this processor which can be processed without accessing halo data. * **Owned**: Entities owned by this processor which access halo data when processed. * **Exec halo**: Off-processor entities which are redundantly executed over because they touch owned entities. * **Non-exec halo**: Off-processor entities which are not processed, but read when computing the exec halo. The following diagram illustrates the four sections for a mesh distributed among two processors: .. figure:: images/pyop2_mpi_mesh.svg :align: center A mesh distributed among two processors with the entities of each mesh partition divided into *core*, *owned*, *exec halo* and *non-exec halo*. Matching halo sections are highlighted in matching colours. The owned section of process 0 correspondonds to the non-exec section of process 1. For data defined on the :class:`~pyop2.Set` to be stored contiguously per section, local :class:`~pyop2.Set` entities must be numbered such that core entities are first, followed by owned, exec halo and non-exec halo in that order. A good partitioning maximises the size of the core section and minimises the halo regions. We can therefore assume that the vast majority of local :class:`~pyop2.Set` entities are in the core section. Computation-communication Overlap --------------------------------- The ordering of :class:`~pyop2.Set` entities into four sections allow for a very efficient overlap of computation and communication. Core entities that do not access any halo data can be processed entirely without access to halo data immediately after the halo exchange has been initiated. Execution over the owned and exec halo regions requires up to date halo data and can only start once the halo exchange is completed. Depending on the latency and bandwidth of communication and the size of the core section relative to the halo, the halo exchange may complete before the computation on the core section. The entire process is given below: :: halo_exchange_begin() # Initiate halo exchange maybe_set_dat_dirty() # Mark Dats as modified compute_if_not_empty(itset.core_part) # Compute core region halo_exchange_end() # Wait for halo exchange compute_if_not_empty(itset.owned_part) # Compute owned region reduction_begin() # Initiate reductions if needs_exec_halo: # Any indirect Dat not READ? compute_if_not_empty(itset.exec_part) # Compute exec halo region reduction_end() # Wait for reductions maybe_set_halo_update_needed() # Mark halos as out of date assemble() # Finalise matrix assembly Any reductions depend on data from the core and owned sections and are initiated as soon as the owned section has been processed and execute concurrently with computation on the exec halo. Similar to `halo_exchange_begin` and `halo_exchange_end`, `reduction_begin` and `reduction_end` do no work at all if none of the :func:`~pyop2.par_loop` arguments requires a reduction. If the :func:`~pyop2.par_loop` assembles a :class:`~pyop2.Mat`, the matrix assembly is finalised at the end. By dividing entities into sections according to their relation to the halo, there is no need to check whether or not a given entity touches the halo or not during computations on each section. This avoids branching in kernels or wrapper code and allows launching separate kernels for GPU execution of each section. The :func:`~pyop2.par_loop` execution therefore has the above structure for all backends. Halo exchange ------------- Exchanging halo data is only required if the halo data is actually read, which is the case for :class:`~pyop2.Dat` arguments to a :func:`~pyop2.par_loop` used in :data:`pyop2.READ` or :data:`pyop2.RW` mode. PyOP2 keeps track whether or not the halo region may have been modified. This is the case for :class:`Dats ` used in :data:`pyop2.INC`, :data:`pyop2.WRITE` or :data:`pyop2.RW` mode or when a :class:`~pyop2.Solver` or a user requests access to the data. A halo exchange is triggered only for halos marked as out of date. Distributed Assembly -------------------- For an MPI distributed matrix or vector, assembling owned entities at the boundary can contribute to off-process degrees of freedom and vice versa. There are different ways of accounting for these off-process contributions. PETSc_ supports insertion and subsequent communication of off-process matrix and vector entries, however its implementation is not thread safe. Concurrent insertion into PETSc_ MPI matrices *is* thread safe if off-process insertions are not cached and concurrent writes to rows are avoided, which is done through colouring as described in :ref:`plan-colouring`. PyOP2 therefore disables PETSc_'s off-process insertion feature and instead redundantly computes over all off process entities that touch local dofs, which is the *exec halo* section described above. The price for this is maintaining a larger halo, since we also need halo data, the *non-exec halo* section, to perform the redundant computation. Halos grow by about a factor two, however in practice this is still small compared to the interior region of a partition and the main cost of halo exchange is the latency, which is independent of the exchanged data volume. .. _PETSc: http://www.mcs.anl.gov/petsc/