Parallel Execution Plan

For all PyOP2 backends with the exception of sequential, a parallel execution plan is computed for each par_loop(). It contains information guiding the code generator on how to partition, stage and colour the data for efficient parallel processing.

Partitioning

The iteration set is split into a number of equally sized and contiguous mini-partitions such that the working set of each mini-partition fits into shared memory or last level cache. This is unrelated to the partitioning required for MPI as described in MPI.

Local Renumbering and Staging

While a mini-partition is a contiguous chunk of the iteration set, the indirectly accessed data it references is not necessarily contiguous. For each mini-partition and unique Dat-Map pair, a mapping from local indices within the partition to global indices is constructed as the sorted array of unique Map indices accessed by this partition. At the same time, a global-to-local mapping is constructed as its inverse.

Data for indirectly accessed Dat arguments is staged in shared device memory as described in PyOP2 Backends. For each partition, the local-to-global mapping indicates where data to be staged in is read from and the global-to-local mapping gives the location in shared memory data has been staged at. The amount of shared memory required is computed from the size of the local-to-global mapping.

Colouring

A two-level colouring is used to avoid race conditions. Partitions are coloured such that partitions of the same colour can be executed concurrently and threads executing on a partition in parallel are coloured such that no two threads indirectly reference the same data. Only par_loop() arguments performing an indirect reduction or assembling a matrix require colouring. Matrices are coloured per row.

For each element of a Set indirectly accessed in a par_loop(), a bit vector is used to record which colours indirectly reference it. To colour each thread within a partition, the algorithm proceeds as follows:

  1. Loop over all indirectly accessed arguments and collect the colours of all Set elements referenced by the current thread in a bit mask.

  2. Choose the next available colour as the colour of the current thread.

  3. Loop over all Set elements indirectly accessed by the current thread again and set the new colour in their colour mask.

Since the bit mask is a 32-bit integer, up to 32 colours can be processed in a single pass, which is sufficient for most applications. If not all threads can be coloured with 32 distinct colours, the mask is reset and another pass is made, where each newly allocated colour is offset by 32. Should another pass be required, the offset is increased to 64 and so on until all threads are coloured.

_images/pyop2_colouring.svg

Thread colouring within a mini-partition for a Dat on vertices indirectly accessed in a computation over the edges. The edges are coloured such that no two edges touch the same vertex within the partition.

The colouring of mini-partitions is done in the same way, except that all Set elements indirectly accessed by the entire partition are referenced, not only those accessed by a single thread.