The PyOP2 Intermediate Representation¶
The parallel loop
is the main construct of PyOP2.
It applies a specific Kernel
to all elements in the iteration
set of the parallel loop. Here, we describe how to use the PyOP2 API to build
a kernel and, also, we provide simple guidelines on how to write efficient
kernels.
Using the Intermediate Representation¶
In the previous section, we described the API for
PyOP2 kernels in terms of the C code that gets executed.
Passing in a string of C code is the simplest way of creating a
Kernel
. Another possibility is to use PyOP2 Intermediate
Representation (IR) objects to express the Kernel
semantics.
An Abstract Syntax Tree of the kernel code can be manually built using IR objects. Since PyOP2 has been primarily thought to be fed by higher layers of abstractions, rather than by users, no C-to-AST parser is currently provided. The advantage of providing an AST, instead of C code, is that it enables PyOP2 to inspect and transform the kernel, which is aimed at achieving performance portability among different architectures and, more generally, better execution times.
For the purposes of exposition, let us consider a simple
kernel init
which initialises the members of a Dat
to zero.
from op2 import Kernel
code = """void init(double* edge_weight) {
for (int i = 0; i < 3; i++)
edge_weight[i] = 0.0;
}"""
kernel = Kernel(code, "init")
Here, we describe how we can use PyOP2 IR objects to build an AST for the this kernel. For example, the most basic AST one can come up with is
from op2 import Kernel
from ir.ast_base import *
ast = FlatBlock("""void init(double* edge_weight) {
for (int i = 0; i < 3; i++)
edge_weight[i] = 0.0;
}""")
kernel = Kernel(ast, "init")
The FlatBlock
object encapsulates a “flat” block
of code, which is not modified by the IR engine. A
FlatBlock
is used to represent (possibly large)
fragments of code for which we are not interested in any kind of
transformation, so it may be particularly useful to speed up code development
when writing, for example, test cases or non-expensive kernels. On the other
hand, time-demanding kernels should be properly represented using a “real”
AST. For example, an useful AST for init
could be the following
from op2 import Kernel
from ir.ast_base import *
ast_body = [FlatBlock("...some code can go here..."),
c_for("i", 3, Assign(Symbol("edge_weight", ("i",)), c_sym("0.0")))]
ast = FunDecl("void", "init",
[Decl("double*", c_sym("edge_weight"))],
ast_body)
kernel = Kernel(ast, "init")
In this example, we first construct the body of the kernel function. We have
an initial FlatBlock
that contains, for instance,
some sort of initialization code. c_for()
is a shortcut
for building a for loop
. It takes an
iteration variable (i
), the extent of the loop and its body. Multiple
statements in the body can be passed in as a list.
c_sym()
is a shortcut for building symbols
. You may want to use
c_sym()
when the symbol makes no explicit use of
iteration variables.
We use Symbol
instead of
c_sym()
, when edge_weight
accesses a specific
element using the iteration variable i
. This is fundamental to allow the
IR engine to perform many kind of transformations involving the kernel’s
iteration space(s). Finally, the signature of the function is constructed
using the FunDecl
.
Other examples on how to build ASTs can be found in the tests folder,
particularly looking into test_matrices.py
and
test_iteration_space_dats.py
.
Achieving Performance Portability with the IR¶
One of the key objectives of PyOP2 is obtaining performance portability. This means that exactly the same program can be executed on a range of different platforms, and that the PyOP2 engine will strive to get the best performance out of the chosen platform. PyOP2 allows users to write kernels by completely abstracting from the underlying machine. This is mainly achieved in two steps:
Given the AST of a kernel, PyOP2 applies a first transformation aimed at mapping the parallelism inherent to the kernel to that available in the backend.
Then, PyOP2 applies optimizations to the sequential code, depending on the underlying backend.
To maximize the outcome of the transformation process, it is important that kernels are written as simply as possible. That is, premature optimization, possibly for a specific backend, might harm performance.
A minimal language, the so-called PyOP2 Kernel Domain-Specific Language, is
used to trigger specific transformations. If we had had a parser from C
code to AST, we would have embedded this DSL in C by means of pragmas
.
As we directly build an AST, we achieve the same goal by decorating AST nodes
with specific attributes, added at node creation-time. An overview of the
language follows
pragma pyop2 itspace
. This is added toFor
nodes (i.e. written on top of for loops). It tells PyOP2 that the following is a fully-parallel loop, that is all of its iterations can be executed in parallel without any sort of synchronization.pragma pyop2 assembly(itvar1, itvar2)
. This is added to a statement node, to denote that we are performing a local assembly operation along to theitvar1
anditvar2
dimensions.pragma pyop2 simd
. This is added on top of the kernel signature. It is used to suggest PyOP2 to apply SIMD vectorization along the ParLoop’s iteration set dimension. This kind of vectorization is also known as inter-kernel vectorization. This feature is currently not supported by PyOP2, and will be added only in a future release.
The itspace
pragma tells PyOP2 how to extract parallelism from the kernel.
Consider again our usual example. To expose a parallel iteration space, one
one must write
from op2 import Kernel
code = """void init(double* edge_weight) {
#pragma pyop2 itspace
for (int i = 0; i < 3; i++)
edge_weight[i] = 0.0;
}"""
kernel = Kernel(code, "init")
The c_for()
shortcut when creating an AST expresses
the same semantics of a for loop decorated with a pragma pyop2 itspace
.
Now, imagine we are executing the init
kernel on a CPU architecture.
Typically we want a single core to execute the entire kernel, because it is
very likely that the kernel’s iteration space is small and its working set
fits the L1 cache, and no benefit would be gained by splitting the computation
between distinct cores. On the other end, if the backend is a GPU or an
accelerator, a different execution model might give better performance.
There’s a huge amount of parallelism available, for example, in a GPU, so
delegating the execution of an individual iteration (or a chunk of iterations)
to a single thread could pay off. If that is the case, the PyOP2 IR engine
re-structures the kernel code to exploit such parallelism.
Optimizing kernels on CPUs¶
So far, some effort has been spent on optimizations for CPU platforms. Being a DSL, PyOP2 provides specific support for those (linear algebra) operations that are common among unstructured-mesh-based numerical methods. For example, PyOP2 is capable of aggressively optimizing local assembly codes for applications based on the Finite Element Method. We therefore distinguish optimizations in two categories:
Generic optimizations, such as data alignment and support for autovectorization.
Domain-specific optimizations (DSO)
To trigger DSOs, statements must be decorated using the kernel DSL. For example,
if the kernel computes the local assembly of an element in an unstructured mesh,
then a pragma pyop2 assembly(itvar1, itvar2)
should be added on top of the
corresponding statement. When constructing the AST of a kernel, this can be
simply achieved by
from ir.ast_base import *
s1 = Symbol("X", ("i",))
s2 = Symbol("Y", ("j",))
tensor = Symbol("A", ("i", "j"))
pragma = "#pragma pyop2 outerproduct(j,k)"
code = c_for("i", 3, c_for("j", 3, Incr(tensor, Prod(s1, s2), pragma)))
That, conceptually, corresponds to
#pragma pyop2 itspace
for (int i = 0; i < 3; i++)
#pragma pyop2 itspace
for (int j = 0; j < 3; j++)
#pragma pyop2 assembly(i, j)
A[i][j] += X[i]*Y[j]
Visiting the AST, PyOP2 finds a 2-dimensional iteration space and an assembly
statement. Currently, #pragma pyop2 itspace
is ignored when the backend is
a CPU. The #pragma pyop2 assembly(i, j)
can trigger multiple DSOs.
PyOP2 currently lacks an autotuning system that automatically finds out the
best possible kernel implementation; that is, the optimizations that minimize
the kernel run-time. To drive the optimization process, the user (or the
higher layer) can specify which optimizations should be applied. Currently,
PyOP2 can automate:
Alignment and padding of data structures: for issuing aligned loads and stores.
Loop trip count adjustment according to padding: useful for autovectorization when the trip count is not a multiple of the vector length
Loop-invariant code motion and autovectorization of invariant code: this is particularly useful since trip counts are typically small, and hoisted code can still represent a significant proportion of the execution time
Register tiling for rectangular iteration spaces
(DSO for pragma assembly): Outer-product vectorization + unroll-and-jam of outer loops to improve register re-use or to mitigate register pressure
How to select specific kernel optimizations¶
When constructing a Kernel
, it is possible to specify the set
of optimizations we want PyOP2 to apply. The IR engine will analyse the kernel
AST and will try to apply, incrementally, such optimizations. The PyOP2’s FFC
interface, which build a Kernel
object given an AST provided
by FFC, makes already use of the available optimizations. Here, we take the
emblematic case of the FFC interface and describe how to play with the various
optimizations through a series of examples.
ast = ...
opts = {'licm': False,
'tile': None,
'ap': False,
'vect': None}
kernel = Kernel(ast, 'my_kernel', opts)
In this example, we have an AST ast
and we specify optimizations through
the dictionary opts
; then, we build the Kernel
, passing in
the optional argument opts
. No optimizations are enabled here. The
possible options are:
licm
: Loop-Invariant Code Motion.tile
: Register Tiling (of rectangular iteration spaces)ap
: Data alignment, padding. Trip count adjustment.vect
: SIMD intra-kernel vectorization.
If we wanted to apply both loop-invariant code motion and data alignment, we would simply write
ast = ...
opts = {'licm': True,
'ap': True}
kernel = Kernel(ast, 'my_kernel', opts)
Now, let’s assume we know the kernel has a rectangular iteration space. We want to try register tiling, with a particular tile size. The way to get it is
ast = ...
opts = {'tile': (True, 8)}
kernel = Kernel(ast, 'my_kernel', opts)
In this case, the iteration space is sliced into tiles of size 8x8. If the
iteration space is smaller than the slice, then the transformation is not
applied. By specifying -1
instead of 8
, we leave PyOP2 free to choose
automatically a certain tile size.
A fundamental optimization for any PyOP2 kernel is SIMD vectorization. This is because almost always kernels fit the L1 cache and are likely to be compute- bound. Backend compilers’ AutoVectorization (AV) is therefore an opportunity. By enforcing data alignment and padding, we can increase the chance AV is successful. To try AV, one should write
import ir.ast_plan as ap
ast = ...
opts = {'ap': True,
'vect': (ap.AUTOVECT, -1)}
kernel = Kernel(ast, 'my_kernel', opts)
The vect
’s second parameter (-1) is ignored when AV is requested.
If our kernel is computing an assembly-like operation, then we can ask PyOP2
to optimize for register locality and register pressure, by resorting to a
different vectorization technique. Early experiments show that this approach
can be particularly useful when the amount of data movement in the assembly
loops is “significant”. Of course, this depends on kernel parameters (e.g.
size of assembly loop, number and size of arrays involved in the assembly) as
well as on architecture parameters (e.g. size of L1 cache, number of available
registers). This strategy takes the name of Outer-Product Vectorization
(OP), and can be activated in the following way (again, we suggest to use it
along with data alignment and padding).
import ir.ast_plan as ap
ast = ...
opts = {'ap': True,
'vect': (ap.V_OP_UAJ, 1)}
kernel = Kernel(ast, 'my_kernel', opts)
UAJ
in V_OP_UAJ
stands for Unroll-and-Jam
. It has been proved that
OP shows a much better performance when used in combination with unrolling the
outer assembly loop and incorporating (jamming) the unrolled iterations
within the inner loop. The second parameter, therefore, specifies the unroll-
and-jam factor: the higher it is, the larger is the number of iterations
unrolled. A factor 1 means that no unroll-and-jam is performed. The optimal
factor highly depends on the computational characteristics of the kernel.