Kernels in PyOP2 define the local operations that are to be performed for each
element of the iteration set the kernel is executed over. There must be a one
to one match between the arguments declared in the kernel signature and the
actual arguments passed to the parallel loop executing this kernel. As
described in *PyOP2 Concepts*, data is accessed directly on the iteration set
or via mappings passed in the `par_loop()` call.

The kernel only sees data corresponding to the current element of the
iteration set it is invoked for. Any data read by the kernel i.e. accessed as
`READ`, `RW` or `INC` is automatically
gathered via the mapping relationship in the *staging in* phase and the kernel
is passed pointers to the staging memory. Similarly, after the kernel has been
invoked, any modified data i.e. accessed as `WRITE`,
`RW` or `INC` is scattered back out via the
`Map` in the *staging out* phase. It is only safe for a kernel
to manipulate data in the way declared via the access descriptor in the
parallel loop call. Any modifications to an argument accessed read-only would
not be written back since the staging out phase is skipped for this argument.
Similarly, the result of reading an argument declared as write-only is
undefined since the data has not been staged in.

Consider a `par_loop()` computing the midpoint of a triangle given
the three vertex coordinates. Note that we make use of a covenience in the
PyOP2 syntax, which allow declaring an anonymous `DataSet` of a
dimension greater one by using the `**` operator. We omit the actual data in
the declaration of the `Map` `cell2vertex` and
`Dat` `coordinates`.

```
vertices = op2.Set(num_vertices)
cells = op2.Set(num_cells)
cell2vertex = op2.Map(cells, vertices, 3, [...])
coordinates = op2.Dat(vertices ** 2, [...], dtype=float)
midpoints = op2.Dat(cells ** 2, dtype=float)
op2.par_loop(midpoint, cells,
midpoints(op2.WRITE),
coordinates(op2.READ, cell2vertex))
```

Kernels are implemented in a restricted subset of C99 and are declared by
passing a *C code string* and the *kernel function name*, which must match the
name in the C kernel signature, to the `Kernel` constructor:

```
midpoint = op2.Kernel("""
void midpoint(double p[2], double *coords[2]) {
p[0] = (coords[0][0] + coords[1][0] + coords[2][0]) / 3.0;
p[1] = (coords[0][1] + coords[1][1] + coords[2][1]) / 3.0;
}""", "midpoint")
```

Since kernels cannot return any value, the return type is always `void`. The
kernel argument `p` corresponds to the third `par_loop()`
argument `midpoints` and `coords` to the fourth argument `coordinates`
respectively. Argument names need not agree, the matching is by position.

Data types of kernel arguments must match the type of data passed to the
parallel loop. The Python types `float` and `numpy.float64`
correspond to a C `double`, `numpy.float32` to a C
`float`, `int` or `numpy.int64` to a C `long` and
`numpy.int32` to a C `int`.

Direct `par_loop()` arguments such as `midpoints` are passed to
the kernel as a `double *`, indirect arguments such as `coordinates` as a
`double **` with the first indirection due to the map and the second
indirection due the data dimension. The kernel signature above uses arrays
with explicit sizes to draw attention to the fact that these are known. We
could have interchangibly used a kernel signature with plain pointers:

```
void midpoint(double * p, double ** coords)
```

Argument creation supports an optional flag `flatten`, which is used
for kernels which expect data to be laid out by component:

```
midpoint = op2.Kernel("""
void midpoint(double p[2], double *coords[1]) {
p[0] = (coords[0][0] + coords[1][0] + coords[2][0]) / 3.0;
p[1] = (coords[3][0] + coords[4][0] + coords[5][0]) / 3.0;
}""", "midpoint")
op2.par_loop(midpoint, cells,
midpoints(op2.WRITE),
coordinates(op2.READ, cell2vertex, flatten=True))
```

Data for a `Dat` declared on a `Set` is
stored contiguously for all elements of the set. For each element,
this is a contiguous chunk of data of a shape given by the
`DataSet` `dim` and the datatype of the
`Dat`. The size of this chunk is the product of the
extents of the `dim` tuple times the size of the datatype.

During execution of the `par_loop()`, the kernel is called
for each element of the iteration set and passed data for each of its
arguments corresponding to the current set element `i` only.

For a directly accessed argument such as `midpoints` above, the
kernel is passed a pointer to the beginning of the chunk of data for
the element `i` the kernel is currently called for. In CUDA/OpenCL
`i` is the global thread id since the kernel is launched in parallel
for all elements.

For an indirectly accessed argument such as `coordinates` above,
PyOP2 gathers pointers to the data via the `Map`
`cell2vertex` used for the indirection. The kernel is passed a list
of pointers of length corresponding to the *arity* of the
`Map`, in the example above 3. Each of these points to
the data chunk for the element in the target `Set` given
by `Map` entries `(i, 0)`, `(i, 1)` and `(i, 2)`.

If the argument is created with the keyword argument `flatten` set
to `True`, a flattened vector of pointers is passed to the kernel.
This vector is of length `dim * arity` (where `dim` is the product
of the extents of the `dim` tuple), which is 6 in the example above.
Each entry points to a single data value of the `Dat`.
The ordering is by component of `dim` i.e. the first component of
each data item for each element in the target set pointed to by the
map followed by the second component etc.

PyOP2 supports complex kernels with large local working set sizes, which may
not run very efficiently on architectures with a limited amount of registers
and on-chip resources. In many cases the resource usage is proportional to the
size of the *local iteration space* the kernel operates on.

Consider a finite-element local assembly kernel for vector-valued basis functions of second order on triangles. There are kernels more complex and computing considerably larger local tensors commonly found in finite-element computations, in particular for higher-order basis functions, and this kernel only serves to illustrate the concept. For each element in the iteration set, this kernel computes a 12x12 local tensor:

```
void kernel(double A[12][12], ...) {
...
// loops over the local iteration space
for (int j = 0; j < 12; j++) {
for (int k = 0; k < 12; k++) {
A[j][k] += ...
}
}
}
```

PyOP2 invokes this kernel for each element in the iteration set:

```
for (int ele = 0; ele < nele; ++ele) {
double A[12][12];
...
kernel(A, ...);
}
```

To improve the efficiency of executing complex kernels on manycore platforms, their operation can be distributed among several threads which each compute a single point in this local iteration space to increase the level of parallelism and to lower the amount of resources required per thread. In the case of the kernel above we obtain:

```
void mass(double A[1][1], ..., int j, int k) {
...
A[0][0] += ...
}
```

Note how the doubly nested loop over basis function is hoisted out of the
kernel, which receives its position in the local iteration space to compute as
additional arguments `j` and `k`. PyOP2 then calls the kernel for
each element of the local iteration space for each set element:

```
for (int ele = 0; ele < nele; ++ele) {
double A[1][1];
...
for (int j = 0; j < 12; j++) {
for (int k = 0; k < 12; k++) {
kernel(A, ..., j, k);
}
}
}
```

On manycore platforms, the local iteration space does not translate into a loop nest, but rather into a larger number of threads being launched to compute each of its elements:

PyOP2 needs to be told to loop over this local iteration space by
indexing the corresponding maps with an
`IterationIndex` `i` in the
`par_loop()` call.