.. _plan:

Parallel Execution Plan
=======================

For all PyOP2 backends with the exception of sequential, a parallel execution
plan is computed for each :func:`~pyop2.par_loop`. It contains information
guiding the code generator on how to partition, stage and colour the data for
efficient parallel processing.

.. _plan-partitioning:

Partitioning
------------

The iteration set is split into a number of equally sized and contiguous
mini-partitions such that the working set of each mini-partition fits into
shared memory or last level cache. This is unrelated to the partitioning
required for MPI as described in :ref:`mpi`.

.. _plan-renumbering:

Local Renumbering and Staging
-----------------------------

While a mini-partition is a contiguous chunk of the iteration set, the
indirectly accessed data it references is not necessarily contiguous. For each
mini-partition and unique :class:`~pyop2.Dat`-:class:`~pyop2.Map` pair, a
mapping from local indices within the partition to global indices is
constructed as the sorted array of unique :class:`~pyop2.Map` indices accessed
by this partition. At the same time, a global-to-local mapping is constructed
as its inverse.

Data for indirectly accessed :class:`~pyop2.Dat` arguments is staged in shared
device memory as described in :ref:`backends`. For each partition, the
local-to-global mapping indicates where data to be staged in is read from and
the global-to-local mapping gives the location in shared memory data has been
staged at. The amount of shared memory required is computed from the size of
the local-to-global mapping.

.. _plan-colouring:

Colouring
---------

A two-level colouring is used to avoid race conditions. Partitions are
coloured such that partitions of the same colour can be executed concurrently
and threads executing on a partition in parallel are coloured such that no two
threads indirectly reference the same data. Only :func:`~pyop2.par_loop`
arguments performing an indirect reduction or assembling a matrix require
colouring. Matrices are coloured per row.

For each element of a :class:`~pyop2.Set` indirectly accessed in a
:func:`~pyop2.par_loop`, a bit vector is used to record which colours
indirectly reference it. To colour each thread within a partition, the
algorithm proceeds as follows:

1. Loop over all indirectly accessed arguments and collect the colours of all
   :class:`~pyop2.Set` elements referenced by the current thread in a bit mask.
2. Choose the next available colour as the colour of the current thread.
3. Loop over all :class:`~pyop2.Set` elements indirectly accessed by the
   current thread again and set the new colour in their colour mask.

Since the bit mask is a 32-bit integer, up to 32 colours can be processed in a
single pass, which is sufficient for most applications. If not all threads can
be coloured with 32 distinct colours, the mask is reset and another pass is
made, where each newly allocated colour is offset by 32. Should another pass
be required, the offset is increased to 64 and so on until all threads are
coloured.

.. figure:: images/pyop2_colouring.svg
  :align: center

  Thread colouring within a mini-partition for a :class:`~pyop2.Dat` on
  vertices indirectly accessed in a computation over the edges. The edges are
  coloured such that no two edges touch the same vertex within the partition.

The colouring of mini-partitions is done in the same way, except that all
:class:`~pyop2.Set` elements indirectly accessed by the entire partition are
referenced, not only those accessed by a single thread.