.. _backends: PyOP2 Backends ============== PyOP2 provides a number of different backends to be able to run parallel computations on different hardware architectures. The currently supported backends are * ``sequential``: runs sequentially on a single CPU core. * ``openmp``: runs multiple threads on an SMP CPU using OpenMP. The number of threads is set with the environment variable ``OMP_NUM_THREADS``. * ``cuda``: offloads computation to a NVIDA GPU (requires :ref:`CUDA and pycuda `) * ``opencl``: offloads computation to an OpenCL device, either a multi-core CPU or a GPU (requires :ref:`OpenCL and pyopencl `) Distributed parallel computations using MPI are supported by PyOP2 and described in detail in :doc:`mpi`. Datastructures must be partitioned among MPI processes with overlapping regions, so called halos. The host backends ``sequential`` and ``openmp`` have full MPI support, the device backends ``cuda`` and ``opencl`` only support parallel loops on :class:`Dats `. Hybrid parallel computations with OpenMP are possible, where ``OMP_NUM_THREADS`` threads are launched per MPI rank. .. _host_backends: Host backends ------------- Any computation in PyOP2 requires the generation of code at runtime specific to each individual :func:`~pyop2.par_loop`. The host backends generate code which is just-in-time (JIT) compiled into a shared library callable via :mod:`ctypes`. The compilation procedure also takes care of caching the compiled library on disk, such that the compilation cost is not paid every time. .. _sequential_backend: Sequential backend ~~~~~~~~~~~~~~~~~~ Since there is no parallel computation for the sequential backend, the generated code is a C wrapper function with a ``for`` loop calling the kernel for the respective :func:`~pyop2.par_loop`. This wrapper also takes care of staging in and out the data as requested by the access descriptors requested in the parallel loop. Both the kernel and the wrapper function are just-in-time compiled in a single compilation unit such that the kernel call can be inlined and does not incur any function call overhead. Recall the :func:`~pyop2.par_loop` calling the ``midpoint`` kernel from :doc:`kernels`: :: op2.par_loop(midpoint, cells, midpoints(op2.WRITE), coordinates(op2.READ, cell2vertex)) .. highlight:: c :linenothreshold: 5 The JIT compiled code for this loop is the kernel followed by the generated wrapper code: :: inline void midpoint(double p[2], double *coords[2]) { p[0] = (coords[0][0] + coords[1][0] + coords[2][0]) / 3.0; p[1] = (coords[0][1] + coords[1][1] + coords[2][1]) / 3.0; } void wrap_midpoint__(PyObject *_start, PyObject *_end, PyObject *_arg0_0, PyObject *_arg1_0, PyObject *_arg1_0_map0_0) { int start = (int)PyInt_AsLong(_start); int end = (int)PyInt_AsLong(_end); double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data); double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data); int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data); double *arg1_0_vec[3]; for ( int n = start; n < end; n++ ) { int i = n; arg1_0_vec[0] = arg1_0 + arg1_0_map0_0[i * 3 + 0] * 2; arg1_0_vec[1] = arg1_0 + arg1_0_map0_0[i * 3 + 1] * 2; arg1_0_vec[2] = arg1_0 + arg1_0_map0_0[i * 3 + 2] * 2; midpoint(arg0_0 + i * 2, arg1_0_vec); } } Note that the wrapper function is called directly from Python and therefore all arguments are plain Python objects, which first need to be unwrapped. The arguments ``_start`` and ``_end`` define the iteration set indices to iterate over. The remaining arguments are :class:`arrays ` corresponding to a :class:`~pyop2.Dat` or :class:`~pyop2.Map` passed to the :func:`~pyop2.par_loop`. Arguments are consecutively numbered to avoid name clashes. The first :func:`~pyop2.par_loop` argument ``midpoints`` is direct and therefore no corresponding :class:`~pyop2.Map` is passed to the wrapper function and the data pointer is passed straight to the kernel with an appropriate offset. The second argument ``coordinates`` is indirect and hence a :class:`~pyop2.Dat`-:class:`~pyop2.Map` pair is passed. Pointers to the data are gathered via the :class:`~pyop2.Map` of arity 3 and staged in the array ``arg1_0_vec``, which is passed to the kernel. The coordinate data can therefore be accessed in the kernel via double indirection with the :class:`~pyop2.Map` already applied. Note that for both arguments, the pointers are to two consecutive double values, since the :class:`~pyop2.DataSet` is of dimension two in either case. .. _openmp_backend: OpenMP backend ~~~~~~~~~~~~~~ In contrast to the sequential backend, the outermost ``for`` loop in the OpenMP backend is annotated with OpenMP pragmas to execute in parallel with multiple threads. To avoid race conditions on data access, the iteration set is coloured and a thread safe execution plan is computed as described in :ref:`plan-colouring`. The JIT compiled code for the parallel loop from above changes as follows: :: void wrap_midpoint__(PyObject* _boffset, PyObject* _nblocks, PyObject* _blkmap, PyObject* _offset, PyObject* _nelems, PyObject *_arg0_0, PyObject *_arg1_0, PyObject *_arg1_0_map0_0) { int boffset = (int)PyInt_AsLong(_boffset); int nblocks = (int)PyInt_AsLong(_nblocks); int* blkmap = (int *)(((PyArrayObject *)_blkmap)->data); int* offset = (int *)(((PyArrayObject *)_offset)->data); int* nelems = (int *)(((PyArrayObject *)_nelems)->data); double *arg0_0 = (double *)(((PyArrayObject *)_arg0_0)->data); double *arg1_0 = (double *)(((PyArrayObject *)_arg1_0)->data); int *arg1_0_map0_0 = (int *)(((PyArrayObject *)_arg1_0_map0_0)->data); double *arg1_0_vec[32][3]; #ifdef _OPENMP int nthread = omp_get_max_threads(); #else int nthread = 1; #endif #pragma omp parallel shared(boffset, nblocks, nelems, blkmap) { int tid = omp_get_thread_num(); #pragma omp for schedule(static) for (int __b = boffset; __b < boffset + nblocks; __b++) { int bid = blkmap[__b]; int nelem = nelems[bid]; int efirst = offset[bid]; for (int n = efirst; n < efirst+ nelem; n++ ) { int i = n; arg1_0_vec[tid][0] = arg1_0 + arg1_0_map0_0[i * 3 + 0] * 2; arg1_0_vec[tid][1] = arg1_0 + arg1_0_map0_0[i * 3 + 1] * 2; arg1_0_vec[tid][2] = arg1_0 + arg1_0_map0_0[i * 3 + 2] * 2; midpoint(arg0_0 + i * 2, arg1_0_vec[tid]); } } } } Computation is split into ``nblocks`` blocks which start at an initial offset ``boffset`` and correspond to colours that can be executed conflict free in parallel. This loop over colours is therefore wrapped in an OpenMP parallel region and is annotated with an ``omp for`` pragma. The block id ``bid`` for each of these blocks is given by the block map ``blkmap`` and is the index into the arrays ``nelems`` and ``offset`` provided as part of the execution plan. These are the number of elements that are part of the given block and its starting index. Note that each thread needs its own staging array ``arg1_0_vec``, which is therefore scoped by the thread id. .. _device_backends: Device backends --------------- As with the host backends, the device backends have most of the implementation in common. The PyOP2 data carriers :class:`~pyop2.Dat`, :class:`~pyop2.Global` and :class:`~pyop2.Const` have a data array in host memory and a separate array in device memory. Flags indicate the present state of a given data carrier: * ``DEVICE_UNALLOCATED``: no data is allocated on the device * ``HOST_UNALLOCATED``: no data is allocated on the host * ``DEVICE``: data is up-to-date (valid) on the device, but invalid on the host * ``HOST``: data is up-to-date (valid) on the host, but invalid on the device * ``BOTH``: data is up-to-date (valid) on both the host and device When a :func:`~pyop2.par_loop` is called, PyOP2 uses the :ref:`access-descriptors` to determine which data needs to be allocated or transferred from host to device prior to launching the kernel. Data is only transferred if it is out of date at the target location and all data transfer is triggered lazily i.e. the actual copy only occurs once the data is requested. In particular there is no automatic transfer back of data from device to host unless it is accessed on the host. A newly created device :class:`~pyop2.Dat` has no associated device data and starts out in the state ``DEVICE_UNALLOCATED``. The diagram below shows all actions that involve a state transition, which can be divided into three groups: calling explicit data transfer functions (red), access data on the host (black) and using the :class:`~pyop2.Dat` in a :func:`~pyop2.par_loop` (blue). There is no need for users to explicitly initiate data transfers and the tranfer functions are only given for completeness. .. figure:: images/pyop2_device_data_state.svg :align: center State transitions of a data carrier on PyOP2 device backends When a device :class:`~pyop2.Dat` is used in a :func:`~pyop2.par_loop` for the first time, data is allocated on the device. If the :class:`~pyop2.Dat` is only read, the host array is transferred to device if it was in state ``HOST`` or ``DEVICE_UNALLOCATED`` before the :func:`~pyop2.par_loop` and the :class:`~pyop2.Dat` is in the state ``BOTH`` afterwards, unless it was in state ``DEVICE`` in which case it remains in that state. If the :class:`~pyop2.Dat` is written to, data transfer before the :func:`~pyop2.par_loop` is necessary unless the access descriptor is :data:`~pyop2.WRITE` and the host data is out of date afterwards and the :class:`~pyop2.Dat` is in the state ``DEVICE``. An overview of the state transitions and necessary memory allocations and data transfers for the two cases is given in the table below: ====================== ============================== ================================================== Initial state :func:`~pyop2.par_loop` read :func:`~pyop2.par_loop` written to ====================== ============================== ================================================== ``DEVICE_UNALLOCATED`` ``BOTH`` (alloc, transfer h2d) ``DEVICE`` (alloc, transfer h2d unless write-only) ``DEVICE`` ``DEVICE`` ``DEVICE`` ``HOST`` ``BOTH`` (transfer h2d) ``DEVICE`` (transfer h2d unless write-only) ``BOTH`` ``BOTH`` ``DEVICE`` ====================== ============================== ================================================== Accessing data on the host initiates a device to host data transfer if the :class:`~pyop2.Dat` is in state ``DEVICE`` and leaves it in state ``HOST`` when using the :meth:`~pyop2.Dat.data` property and ``BOTH`` when using :meth:`~pyop2.Dat.data_ro`. The state transitions described above apply in the same way to a :class:`~pyop2.Global`. A :class:`~pyop2.Const` is read-only, never modified on device and therefore never out of date on the host. Hence there is no state ``DEVICE`` and it is not necessary to copy back :class:`~pyop2.Const` data from device to host. .. _cuda_backend: CUDA backend ~~~~~~~~~~~~ The CUDA backend makes extensive use of PyCUDA_ and its infrastructure for just-in-time compilation of CUDA kernels and interfacing them to Python. Linear solvers and sparse matrix data structures are implemented on top of the `CUSP library`_ and are described in greater detail in :doc:`linear_algebra`. Code generation uses a template based approach, where a ``__global__`` stub routine to be called from the host is generated, which takes care of data marshalling and calling the user kernel as an inline ``__device__`` function. We consider the same ``midpoint`` kernel as in the previous examples, which requires no CUDA-specific modifications and is automatically annotated with a ``__device__`` qualifier. PyCUDA_ automatically generates a host stub for the generated kernel stub ``__midpoint_stub`` given a list of parameter types. It takes care of translating Python objects to plain C data types and pointers, such that a CUDA kernel can be launched straight from Python. The entire CUDA code PyOP2 generates is as follows: :: __device__ void midpoint(double p[2], double *coords[2]) { p[0] = ((coords[0][0] + coords[1][0]) + coords[2][0]) / 3.0; p[1] = ((coords[0][1] + coords[1][1]) + coords[2][1]) / 3.0; } __global__ void __midpoint_stub(int set_size, int set_offset, double *arg0, double *ind_arg1, int *ind_map, short *loc_map, int *ind_sizes, int *ind_offs, int block_offset, int *blkmap, int *offset, int *nelems, int *nthrcol, int *thrcol, int nblocks) { extern __shared__ char shared[]; __shared__ int *ind_arg1_map; __shared__ int ind_arg1_size; __shared__ double * ind_arg1_shared; __shared__ int nelem, offset_b, offset_b_abs; double *ind_arg1_vec[3]; if (blockIdx.x + blockIdx.y * gridDim.x >= nblocks) return; if (threadIdx.x == 0) { int blockId = blkmap[blockIdx.x + blockIdx.y * gridDim.x + block_offset]; nelem = nelems[blockId]; offset_b_abs = offset[blockId]; offset_b = offset_b_abs - set_offset; ind_arg1_size = ind_sizes[0 + blockId * 1]; ind_arg1_map = &ind_map[0 * set_size] + ind_offs[0 + blockId * 1]; int nbytes = 0; ind_arg1_shared = (double *) &shared[nbytes]; } __syncthreads(); // Copy into shared memory for ( int idx = threadIdx.x; idx < ind_arg1_size * 2; idx += blockDim.x ) { ind_arg1_shared[idx] = ind_arg1[idx % 2 + ind_arg1_map[idx / 2] * 2]; } __syncthreads(); // process set elements for ( int idx = threadIdx.x; idx < nelem; idx += blockDim.x ) { ind_arg1_vec[0] = ind_arg1_shared + loc_map[0*set_size + idx + offset_b]*2; ind_arg1_vec[1] = ind_arg1_shared + loc_map[1*set_size + idx + offset_b]*2; ind_arg1_vec[2] = ind_arg1_shared + loc_map[2*set_size + idx + offset_b]*2; midpoint(arg0 + 2 * (idx + offset_b_abs), ind_arg1_vec); } } The CUDA kernel ``__midpoint_stub`` is launched on the GPU for a specific number of threads in parallel. Each thread is identified inside the kernel by its thread id ``threadIdx`` within a block of threads identified by a two dimensional block id ``blockIdx`` within a grid of blocks. As for OpenMP, there is the potential for data races, which are prevented by colouring the iteration set and computing a parallel execution plan, where all elements of the same colour can be modified simultaneously. Each colour is computed by a block of threads in parallel. All threads of a thread block have access to a shared memory, which is used as a shared staging area initialised by thread 0 of each block, see lines 30-41 above. A call to ``__syncthreads()`` ensures these initial values are visible to all threads of the block. After this barrier, all threads cooperatively gather data from the indirectly accessed :class:`~pyop2.Dat` via the :class:`~pyop2.Map`, followed by another synchronisation. Following that, each thread loops over the elements in the partition with an increment of the block size. In each iteration a thread-private array of pointers to coordinate data in shared memory is built which is then passed to the ``midpoint`` kernel. As for other backends, the first, directly accessed, argument, is passed as a pointer to global device memory with a suitable offset. .. _opencl_backend: OpenCL backend ~~~~~~~~~~~~~~ The other device backend OpenCL is structurally very similar to the CUDA backend. It uses PyOpenCL_ to interface to the OpenCL drivers and runtime. Linear algebra operations are handled by PETSc_ as described in :doc:`linear_algebra`. PyOP2 generates a kernel stub from a template similar to the CUDA case. Consider the ``midpoint`` kernel from previous examples, whose parameters in the kernel signature are automatically annotated with OpenCL storage qualifiers. PyOpenCL_ provides Python wrappers for OpenCL runtime functions to build a kernel from a code string, set its arguments and enqueue the kernel for execution. It takes care of the necessary conversion from Python objects to plain C data types. PyOP2 generates the following code for the ``midpoint`` example: :: #define ROUND_UP(bytes) (((bytes) + 15) & ~15) void midpoint(__global double p[2], __local double *coords[2]); void midpoint(__global double p[2], __local double *coords[2]) { p[0] = ((coords[0][0] + coords[1][0]) + coords[2][0]) / 3.0; p[1] = ((coords[0][1] + coords[1][1]) + coords[2][1]) / 3.0; } __kernel __attribute__((reqd_work_group_size(668, 1, 1))) void __midpoint_stub( __global double* arg0, __global double* ind_arg1, int set_size, int set_offset, __global int* p_ind_map, __global short *p_loc_map, __global int* p_ind_sizes, __global int* p_ind_offsets, __global int* p_blk_map, __global int* p_offset, __global int* p_nelems, __global int* p_nthrcol, __global int* p_thrcol, __private int block_offset) { __local char shared [64] __attribute__((aligned(sizeof(long)))); __local int offset_b; __local int offset_b_abs; __local int active_threads_count; int nbytes; int block_id; int i_1; // shared indirection mappings __global int* __local ind_arg1_map; __local int ind_arg1_size; __local double* __local ind_arg1_shared; __local double* ind_arg1_vec[3]; if (get_local_id(0) == 0) { block_id = p_blk_map[get_group_id(0) + block_offset]; active_threads_count = p_nelems[block_id]; offset_b_abs = p_offset[block_id]; offset_b = offset_b_abs - set_offset;ind_arg1_size = p_ind_sizes[0 + block_id * 1]; ind_arg1_map = &p_ind_map[0 * set_size] + p_ind_offsets[0 + block_id * 1]; nbytes = 0; ind_arg1_shared = (__local double*) (&shared[nbytes]); nbytes += ROUND_UP(ind_arg1_size * 2 * sizeof(double)); } barrier(CLK_LOCAL_MEM_FENCE); // staging in of indirect dats for (i_1 = get_local_id(0); i_1 < ind_arg1_size * 2; i_1 += get_local_size(0)) { ind_arg1_shared[i_1] = ind_arg1[i_1 % 2 + ind_arg1_map[i_1 / 2] * 2]; } barrier(CLK_LOCAL_MEM_FENCE); for (i_1 = get_local_id(0); i_1 < active_threads_count; i_1 += get_local_size(0)) { ind_arg1_vec[0] = ind_arg1_shared + p_loc_map[i_1 + 0*set_size + offset_b] * 2; ind_arg1_vec[1] = ind_arg1_shared + p_loc_map[i_1 + 1*set_size + offset_b] * 2; ind_arg1_vec[2] = ind_arg1_shared + p_loc_map[i_1 + 2*set_size + offset_b] * 2; midpoint((__global double* __private)(arg0 + (i_1 + offset_b_abs) * 2), ind_arg1_vec); } } Parallel computations in OpenCL are executed by *work items* organised into *work groups*. OpenCL requires the annotation of all pointer arguments with the memory region they point to: ``__global`` memory is visible to any work item, ``__local`` memory to any work item within the same work group and ``__private`` memory is private to a work item. PyOP2 does this annotation automatically for the user kernel if the OpenCL backend is used. Local memory therefore corresponds to CUDA's shared memory and private memory is called local memory in CUDA. The work item id within the work group is accessed via the OpenCL runtime call ``get_local_id(0)``, the work group id via ``get_group_id(0)``. A barrier synchronisation across all work items of a work group is enforced with a call to ``barrier(CLK_LOCAL_MEM_FENCE)``. Bearing these differences in mind, the OpenCL kernel stub is structurally almost identical to the corresponding CUDA version above. The required local memory size per work group ``reqd_work_group_size`` is computed as part of the execution plan. In CUDA this value is a launch parameter to the kernel, whereas in OpenCL it needs to be hard coded as a kernel attribute. .. _FEniCS project: http://fenicsproject.org .. _PyCUDA: http://mathema.tician.de/software/pycuda/ .. _CUSP library: http://cusplibrary.github.io .. _PyOpenCL: http://mathema.tician.de/software/pyopencl/ .. _PETSc: http://www.mcs.anl.gov/petsc/petsc-as/