COFFEE’s autotuning system.
Bases: object
Initialize the autotuner.
Parameters: 


Return the fastest kernel implementation.
Parameters:  resolution – the amount of time in milliseconds a kernel is run. 

This file contains the hierarchy of classes that implement a kernel’s Abstract Syntax Tree (ast).
Bases: object
Dummy mixin class used to decorate classes which can form part of a perfect loop nest.
Bases: object
The base class of the AST.
Bases: pyop2.coffee.ast_base.Node
Root of the AST.
Bases: pyop2.coffee.ast_base.Node
Generic expression.
Bases: pyop2.coffee.ast_base.Expr
Generic binary expression.
Bases: pyop2.coffee.ast_base.Expr
Generic unary expression.
Bases: pyop2.coffee.ast_base.UnaryExpr
Unary negation of an expression
Bases: pyop2.coffee.ast_base.Expr
Array Initilizer. A ndimensional array A can be statically initialized to some values. For example
A[3][3] = {{0.0}} or A[3] = {1, 1, 1}.
At the moment, initial values like {{0.0}} and {1, 1, 1} are passed in as simple strings.
Bases: pyop2.coffee.ast_base.ArrayInit
Array initilizer in which zerocolumns, i.e. columns full of zeros, are explictly tracked. Only bidimensional arrays are allowed.
Zero columns are tracked once the object is instantiated.
Parameters: 


Bases: pyop2.coffee.ast_base.UnaryExpr
Parenthesis object.
Bases: pyop2.coffee.ast_base.BinExpr
Binary sum.
Bases: pyop2.coffee.ast_base.BinExpr
Binary subtraction.
Bases: pyop2.coffee.ast_base.BinExpr
Binary product.
Bases: pyop2.coffee.ast_base.BinExpr
Binary division.
Bases: pyop2.coffee.ast_base.BinExpr
Compare two expressions using the operand <.
Bases: pyop2.coffee.ast_base.Expr, pyop2.coffee.ast_base.Perfect
Function call.
Bases: pyop2.coffee.ast_base.Expr
Ternary operator: expr ? true_stmt : false_stmt.
Bases: pyop2.coffee.ast_base.Expr
A generic symbol. The length of rank is the tensor rank:
Parameters:  rank (tuple) – entries represent the iteration variables the symbol depends on, or explicit numbers representing the entry of a tensor the symbol is accessing, or the size of the tensor itself. 

Bases: pyop2.coffee.ast_base.Sum
Sum of two vector registers using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Sub
Subtraction of two vector registers using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Prod
Product of two vector registers using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Div
Division of two vector registers using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Symbol
Load of values in a vector register using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Symbol
Replicate the symbol’s value in all slots of a vector register using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Node
Base class for commands productions.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Empty statement.
Bases: pyop2.coffee.ast_base.Statement
Treat a chunk of code as a single statement, i.e. a C string
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Assign an expression to a symbol.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Increment a symbol by an expression.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Decrement a symbol by an expression.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Inplace multiplication of a symbol by an expression.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Inplace division of a symbol by an expression.
Bases: pyop2.coffee.ast_base.Statement, pyop2.coffee.ast_base.Perfect
Declaration of a symbol.
Syntax:
[qualifiers] typ sym [attributes] [= init];
E.g.:
static const double FE0[3][3] __attribute__(align(32)) = {{...}};
Return the size of the declared variable. In particular, return
If the declared array:
Then return a tuple of the first and last nonzero columns in the array. Else, return an empty tuple.
Bases: pyop2.coffee.ast_base.Statement
Block of statements.
Bases: pyop2.coffee.ast_base.Statement
Represent the classic for loop of an imperative language, although some restrictions must be considered: only a single iteration variable can be declared and modified (i.e. it is not supported something like
for (int i = 0, j = 0; ...)
Bases: pyop2.coffee.ast_base.Statement
Switch construct.
Parameters: 


Bases: pyop2.coffee.ast_base.Statement
Function declaration.
Syntax:
[pred] ret name ([args]) {body};
E.g.:
static inline void foo(int a, int b) {return;};
Bases: pyop2.coffee.ast_base.Assign
Store of values in a vector register using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Statement
Permutation of values in a vector register using AVX intrinsics. The intrinsic function used is _mm256_permute_pd.
Bases: pyop2.coffee.ast_base.Statement
Permutation of values in two vector registers using AVX intrinsics. The intrinsic function used is _mm256_permute2f128_pd.
Bases: pyop2.coffee.ast_base.Statement
Unpack of values in a vector register using AVX intrinsics. The intrinsic function used is _mm256_unpackhi_pd.
Bases: pyop2.coffee.ast_base.Statement
Unpack of values in a vector register using AVX intrinsics. The intrinsic function used is _mm256_unpacklo_pd.
Bases: pyop2.coffee.ast_base.Statement
Set to 0 the entries of a vector register using AVX intrinsics.
Bases: pyop2.coffee.ast_base.Node
Represent directives which are handled by the C’s preprocessor.
Indent each row of the given string block with n*2 spaces.
Bases: object
Convert assembly code into sequences of calls to external dense linear algebra libraries. Currently, MKL, ATLAS, and EIGEN are supported.
Initialize an AssemblyLinearAlgebra object.
Parameters: 


Transform perfect loop nests representing matrixmatrix multiplies into calls to a dense linear algebra library.
Parameters:  library – the BLAS library that should be used (mkl, atlas, eigen). 

Bases: object
Assembly optimiser interface class
Initialize the AssemblyOptimizer.
Parameters: 


Remove fullyparallel loop from the iteration space. These are the loops that were marked by the user/higher layer with a pragma pyop2 itspace.
Rewrite an assembly expression to minimize floating point operations and relieve register pressure. This involves several possible transformations:
Parameters:  level – The optimization level (0, 1, 2, 3, 4). The higher, the more invasive is the rewriting of the assembly expressions, trying to eliminate unnecessary floating point operations.


Perform slicing of the innermost loop to enhance register reuse. For example, given a loop:
for i = 0 to N
f()
the following sequence of loops is generated:
for i = 0 to k
f()
for i = k to 2k
f()
# ...
for i = (N1)k to N
f()
The goal is to improve register reuse by relying on the backend compiler unrolling and vectorpromoting the sliced loops.
Unroll loops in the assembly nest.
Parameters:  loops_factor – dictionary from loops to unroll (factor, increment). Loops are specified as integers:
A factor of 0 denotes that the corresponding loop is not present. 

Permute the integration loop with the innermost loop in the assembly nest. This transformation is legal if _precompute was invoked. Storage layout of all 2dimensional arrays involved in the element matrix computation is transposed.
Split assembly expressions into multiple chunks exploiting sum’s associativity. Each chunk will have cut summands.
For example, consider the following piece of code:
for i
for j
A[i][j] += X[i]*Y[j] + Z[i]*K[j] + B[i]*X[j]
If cut=1 the expression is cut into chunks of length 1:
for i
for j
A[i][j] += X[i]*Y[j]
for i
for j
A[i][j] += Z[i]*K[j]
for i
for j
A[i][j] += B[i]*X[j]
If cut=2 the expression is cut into chunks of length 2, plus a remainder chunk of size 1:
for i
for j
A[i][j] += X[i]*Y[j] + Z[i]*K[j]
# Remainder:
for i
for j
A[i][j] += B[i]*X[j]
Bases: object
Provide operations to rewrite an assembly expression:
Initialize the AssemblyRewriter.
Parameters: 


Perform loopinvariant code motion.
Invariant expressions found in the loop nest are moved “after” the outermost independent loop and “after” the fastest varying dimension loop. Here, “after” means that if the loop nest has two loops i and j, and j is in the body of i, then i comes after j (i.e. the loop nest has to be read from right to left).
For example, if a subexpression E depends on [i, j] and the loop nest has three loops [i, j, k], then E is hoisted out from the body of k to the body of i). All hoisted expressions are then wrapped within a suitable loop in order to exploit compiler autovectorization. Note that this applies to constant subexpressions as well, in which case hoisting after the outermost loop takes place.
For each variable in the assembly expression, count how many times it appears as involved in some operations. For example, for the expression a*(5+c) + b*(a+4), return {a: 2, b: 1, c: 1}.
Expand assembly expressions such that:
Y[j] = f(...)
(X[i]*Y[j])*F + ...
becomes:
Y[j] = f(...)*F
(X[i]*Y[j]) + ...
This may be useful for several purposes:
Apply to the distributivity property to the assembly expression. E.g.
A[i]*B[j] + A[i]*C[j]
becomes
A[i]*(B[j] + C[j]).
Scan the hoisted terms one by one and eliminate duplicate subexpressions. Remove useless assignments (e.g. a = b, and b never used later).
Bases: object
Expand assembly expressions such that:
Y[j] = f(...)
(X[i]*Y[j])*F + ...
becomes:
Y[j] = f(...)*F
(X[i]*Y[j]) + ...
Perform the expansion of the expression rooted in node. Terms are expanded along the iteration variable exp_var.
Bases: object
Base class for classes that handle loop scheduling; that is, loop fusion, loop distribution, etc.
Initialize the LoopScheduler.
Parameters: 


Bases: pyop2.coffee.ast_optimizer.LoopScheduler
Analyze data dependencies and iteration spaces, then merge fusable loops. Statements must be in “soft” SSA form: they can be declared and initialized at declaration time, then they can be assigned a value in only one place.
Merge perfect loop nests rooted in self.root.
Bases: pyop2.coffee.ast_optimizer.LoopScheduler
Analyze data dependencies and iteration spaces, then fission associative operations in expressions. Fissioned expressions are placed in a separate loop nest.
Initialize the ExprLoopFissioner.
Parameters:  cut – number of operands requested to fission expressions. 

Split an expression containing x summands into x/cut chunks. Each chunk is placed in a separate loop nest if copy_loops is true, in the same loop nest otherwise. Return a dictionary of all of the split chunks, in which each entry has the same format of expr.
Parameters: 


Bases: pyop2.coffee.ast_optimizer.LoopScheduler
Analyze data dependencies, iteration spaces, and domainspecific information to perform symbolic execution of the assembly code so as to determine how to restructure the loop nests to skip iteration over zerovalued columns. This implies that loops can be fissioned or merged. For example:
for i = 0, N
A[i] = C[i]*D[i]
B[i] = E[i]*F[i]
If the evaluation of A requires iterating over a region of contiguous zerovalued columns in C and D, then A is computed in a separate (smaller) loop nest:
for i = 0 < (Nk)
A[i+k] = C[i+k][i+k]
for i = 0, N
B[i] = E[i]*F[i]
Initialize the ZeroLoopScheduler.
Parameters:  decls – lists of array declarations. A 2tuple is expected: the first element is the list of kernel declarations; the second element is the list of hoisted temporaries declarations. 

Restructure the loop nests rooted in self.root based on the propagation of zerovalued columns along the computation. This, therefore, involves fissing and fusing loops so as to remove iterations spent performing arithmetic operations over zerovalued entries. Return a list of dictionaries, a dictionary for each loop nest encountered. Each entry in a dictionary is of the form {stmt: (itvars, parent, loops)}, in which stmt is a statement found in the loop nest from which the dictionary derives, itvars is the tuple of the iteration variables of the enclosing loops, parent is the AST node in which the loop nest is rooted, loops is the tuple of loops composing the loop nest.
Bases: object
Track readafterwrite dependencies between symbols.
Extract symbols from expr and create a readafterwrite dependency with sym. If sym already has a dependency, then sym has a self dependency on itself.
If target_sym is not provided, return True if sym has a readafterwrite dependency with some other symbols. This is the case if sym has either a self dependency or at least one input edge, meaning that other symbols depend on it. Otherwise, if target_sym is not None, return True if sym has a readafterwrite dependency on it, i.e. if there is an edge from target_sym to sym.
Transform the kernel’s AST according to the backend we are running over.
Bases: object
Manipulate the kernel’s Abstract Syntax Tree.
The single functionality present at the moment is provided by the plan_gpu() method, which transforms the AST for GPU execution.
Transform the kernel suitably for GPU execution.
Loops decorated with a pragma pyop2 itspace are hoisted out of the kernel. The list of arguments in the function signature is enriched by adding iteration variables of hoisted loops. Size of kernel’s nonconstant tensors modified in hoisted loops are modified accordingly.
For example, consider the following function:
void foo (int A[3]) {
int B[3] = {...};
#pragma pyop2 itspace
for (int i = 0; i < 3; i++)
A[i] = B[i];
}
plan_gpu modifies its AST such that the resulting output code is
void foo(int A[1], int i) {
A[0] = B[i];
}
Transform and optimize the kernel suitably for CPU execution.
Generate a string representation of the AST.
Initialize COFFEE.
Utility functions for AST transformation.
“Increase the stack size it the total space occupied by the kernel’s local arrays is too big.
Return a list of unroll factors to run, given loop sizes in sizes. The return value is a list of tuples, where each element in a tuple represents the unroll factor for the corresponding loop in the nest.
For example, if there are three loops i, j, and k, a tuple (2, 1, 1) in the returned list indicates that the outermost loop i should be unrolled by a factor two (i.e. two iterations), while loops j and k should not be unrolled.
Parameters:  ths – unrolling threshold that cannot be exceed by the overall unroll factor 

Given a dictionary ofs s.t. {'itvar': ofs}, update the various iteration variables in the symbols rooted in node.
Given an itspace in the form
(('itvar', (bound_a, bound_b), ...)),
return
((('it_var', bound_b  bound_a), ...), (('it_var', bound_a), ...))
Given an iterator of iteration spaces, each iteration space represented as a 2tuple containing the start and end point, return a tuple of iteration spaces in which contiguous iteration spaces have been merged. For example:
[(1,3), (4,6)] > ((1,6),)
[(1,3), (5,6)] > ((1,3), (5,6))
Bases: object
Loop vectorizer
Align all data structures accessed in the loop nest to the size in bytes of the vector length.
Pad all data structures accessed in the loop nest to the nearest multiple of the vector length. Adjust trip counts and bounds of all innermost loops where padded arrays are written to. Since padding enforces data alignment of multidimensional arrays, add suitable pragmas to inner loops to inform the backend compiler about this property.
Compute outer products according to opts.
Generate outer product vectorisation of a statement.
Bases: object
Handle allocation of register variables.
Generate the outerproduct intrinsicsbased vectorisation code.
Return x rounded up to the vector length.
Return x rounded down to the vector length.
Find inner loops in the subtree rooted in node.