.. _linear_algebra: PyOP2 Linear Algebra Interface ============================== PyOP2 supports linear algebra operations on sparse matrices using a thin wrapper around the PETSc_ library harnessed via its petsc4py_ interface. As described in :doc:`concepts`, a sparse matrix is a linear operator that maps a :class:`~pyop2.DataSet` representing its row space to a :class:`~pyop2.DataSet` representing its column space and vice versa. These two spaces are commonly the same, in which case the resulting matrix is square. A sparse matrix is represented by a :class:`~pyop2.Mat`, which is declared on a :class:`~pyop2.Sparsity`, representing its non-zero structure. .. _matrix_storage: Sparse Matrix Storage Formats ----------------------------- PETSc_ uses the popular Compressed Sparse Row (CSR) format to only store the non-zero entries of a sparse matrix. In CSR, a matrix is stored as three one-dimensional arrays of *row pointers*, *column indices* and *values*, where the two former are of integer type and the latter of float type, usually double. As the name suggests, non-zero entries are stored per row, where each non-zero is defined by a pair of column index and corresponding value. The column indices and values arrays therefore have a length equal to the total number of non-zero entries. Row indices are given implicitly by the row pointer array, which contains the starting index in the column index and values arrays for the non-zero entries of each row. In other words, the non-zeros for row ``i`` are at positions ``row_ptr[i]`` up to but not including ``row_ptr[i+1]`` in the column index and values arrays. For each row, entries are sorted by column index to allow for faster lookups using a binary search. .. figure:: images/csr.svg :align: center A sparse matrix and its corresponding CSR row pointer, column indices and values arrays For distributed parallel storage with MPI, the rows of the matrix are distribued evenly among the processors. Each row is then again divided into a *diagonal* and an *off-diagonal* part, where the diagonal part comprises columns ``i`` to ``j`` if ``i`` and ``j`` are the first and last row owned by a given processor, and the off-diagonal part all other rows. .. figure:: images/mpi_matrix.svg :align: center Distribution of a sparse matrix among 3 MPI processes .. _matrix_assembly: Matrix assembly --------------- Sparse matrices are assembled by adding up local contributions which are mapped to global matrix entries via a local-to-global mapping represented by a pair of :class:`Maps ` for the row and column space. .. figure:: images/assembly.svg :align: center Assembly of a local tensor :math:`A^K` into a global matrix :math:`A` using the local-to-global mapping :math:`\iota_K^1` for rows and :math:`\iota_K^2` for columns For each :func:`~pyop2.par_loop` that assembles a matrix, PyOP2 generates a call to PETSc_'s MatSetValues_ function for each element of the iteration set, adding the local contributions computed by the user kernel to the global matrix using the given :class:`Maps `. At the end of the :func:`~pyop2.par_loop` PyOP2 automatically calls MatAssemblyBegin_ and MatAssemblyEnd_ to finalise matrix assembly. Consider assembling a :class:`~pyop2.Mat` on a :class:`~pyop2.Sparsity` built from a :class:`~pyop2.Map` from ``elements`` to ``nodes``. The assembly is done in a :func:`~pyop2.par_loop` over ``elements``, where the :class:`~pyop2.Mat` ``A`` is accssed indirectly via the ``elem_node`` :class:`~pyop2.Map` using the :class:`~pyop2.base.IterationIndex` :class:`~pyop2.i`: .. code-block:: python nodes = op2.Set(NUM_NODES, "nodes") elements = op2.Set(NUM_ELE, "elements") elem_node = op2.Map(elements, nodes, 3, ...) sparsity = op2.Sparsity((nodes, nodes), (elem_node, elem_node)) A = op2.Mat(sparsity, np.float64) b = op2.Dat(nodes, dtype=np.float64) # Assemble the matrix mat op2.par_loop(mat_kernel, elements, A(op2.INC, (elem_node[op2.i[0]], elem_node[op2.i[1]])), ...) # Assemble the right-hand side vector b op2.par_loop(rhs_kernel, elements, b(op2.INC, elem_node[op2.i[0]]), ...) The code generated for the :func:`~pyop2.par_loop` assembling the :class:`~pyop2.Mat` for the sequential backend is similar to the following, where initialisation and staging code described in :ref:`sequential_backend` have been omitted for brevity. For each element of the iteration :class:`~pyop2.Set` a buffer for the local tensor is initialised to zero and passed to the user kernel performing the local assembly operation. The ``addto_vector`` call subsequently adds this local contribution to the global sparse matrix. .. code-block:: c void wrap_mat_kernel__(...) { ... for ( int n = start; n < end; n++ ) { int i = n; ... double buffer_arg0_0[3][3] = {{0}}; // local tensor initialised to 0 mat_kernel(buffer_arg0_0, ...); // local assembly kernel addto_vector(arg0_0_0, buffer_arg0_0, // Mat objet, local tensor 3, arg0_0_map0_0 + i * 3, // # rows, global row indices 3, arg0_0_map1_0 + i * 3, // # cols, global column indices 0); // mode: 0 add, 1 insert } } .. _sparsity_pattern: Building a sparsity pattern --------------------------- The sparsity pattern of a matrix is uniquely defined by the dimensions of the :class:`DataSets ` forming its row and column space, and one or more pairs of :class:`Maps ` defining its non-zero structure. This is exploited in PyOP2 by caching sparsity patterns with these unique attributes as the cache key to save expensive recomputation. Whenever a :class:`Sparsity` is initialised, an already computed pattern with the same unique key is returned if it exists. For a valid sparsity, each row :class:`~pyop2.Map` must map to the set of the row :class:`~pyop2.DataSet`, each column :class:`~pyop2.Map` to that of the column :class:`~pyop2.DataSet` and the from sets of each pair must match. A matrix on a sparsity pattern built from more than one pair of maps is assembled by multiple parallel loops iterating over the corresponding iteration set for each pair. Sparsity construction proceeds by iterating each :class:`~pyop2.Map` pair and building a set of indices of the non-zero columns for each row. Each pair of entries in the row and column maps gives the row and column index of a non-zero entry in the matrix and therefore the column index is added to the set of non-zero entries for that particular row. The array of non-zero entries per row is then determined as the size of the set for each row and its exclusive scan yields the row pointer array. The column index array is the concatenation of all the sets. An algorithm for the sequential case is given below: :: for rowmap, colmap in maps: for e in range(rowmap.from_size): for i in range(rowmap.arity): row = rowmap.values[i + e*rowmap.arity] for d in range(colmap.arity): diag[row].insert(colmap.values[d + e * colmap.arity]) For the MPI parallel case a minor modification is required, since for each row a set of diagonal and off-diagonal column indices needs to be built as described in :ref:`matrix_storage`: :: for rowmap, colmap in maps: for e in range(rowmap.from_size): for i in range(rowmap.arity): row = rowmap.values[i + e*rowmap.arity] if row < nrows: for d in range(colmap.arity): if col < ncols: diag[row].insert(colmap.values[d + e*colmap.arity]) else: odiag[row].insert(colmap.values[d + e*colmap.arity]) .. _solving: Solving a linear system ----------------------- PyOP2 provides a :class:`~pyop2.Solver`, wrapping the PETSc_ KSP_ Krylov solvers which support various iterative methods such as Conjugate Gradients (CG), Generalized Minimal Residual (GMRES), a stabilized version of BiConjugate Gradient Squared (BiCGStab) and others. The solvers are complemented with a range of preconditioners from PETSc_'s PC_ collection, which includes Jacobi, incomplete Cholesky and LU decompositions and various multigrid based preconditioners. The choice of solver and preconditioner type and other parameters uses PETSc_'s configuration mechanism documented in the `PETSc manual`_. Options are pased to the :class:`~pyop2.Solver` via the keyword argument ``parameters`` taking a dictionary of arguments or directly via keyword arguments. The solver type is chosen as ``ksp_type``, the preconditioner as ``pc_type`` with the defaults ``cg`` and ``jacobi``. Solving a linear system of the matrix ``A`` assembled above and the right-hand side vector ``b`` for a solution vector ``x`` is done with a call to :meth:`~pyop2.Solver.solve`, where solver and preconditioner are chosen as ``gmres`` and ``ilu``: :: x = op2.Dat(nodes, dtype=np.float64) solver = op2.Solver(ksp_type='gmres', pc_type='ilu') solver.solve(A, x, b) .. _gpu_assembly: GPU matrix assembly ------------------- In a :func:`~pyop2.par_loop` assembling a :class:`~pyop2.Mat` on the GPU, the local contributions are first computed for all elements of the iteration set and stored in global memory in a structure-of-arrays (SoA) data layout such that all threads can write the data out in a coalesced manner. For the example above, the generated CUDA wrapper code is as follows, again omitting initialisation and staging code described in :ref:`cuda_backend`. The user kernel only computes a single element in the local iteration space as detailed in :ref:`local-iteration-spaces`. .. code-block:: c __global__ void __mat_kernel_stub(..., double *arg0, // local matrix data array int arg0_offset, // offset into the array ... ) { ... // omitted initialisation and shared memory staging code for ( int idx = threadIdx.x; idx < nelem; idx += blockDim.x ) { ... // omitted staging code for ( int i0 = 0; i0 < 3; ++i0 ) { for ( int i1 = 0; i1 < 3; ++i1 ) { mass_cell_integral_0_otherwise( (double (*)[1])(arg0 + arg0_offset + idx * 9 + i0 * 3 + i1 * 1), ..., i0, i1); } } } } A separate CUDA kernel given below is launched afterwards to compress the data into a sparse matrix in CSR storage format. Only the values array needs to be computed, since the row pointer and column indices have already been computed when building the sparsity on the host and subsequently transferred to GPU memory. Memory for the local contributions and the values array only needs to be allocated on the GPU. .. code-block:: c __global__ void __lma_to_csr(double *lmadata, // local matrix data array double *csrdata, // CSR values array int *rowptr, // CSR row pointer array int *colidx, // CSR column indices array int *rowmap, // row map array int rowmapdim, // row map arity int *colmap, // column map array int colmapdim, // column map arity int nelems) { int nentries_per_ele = rowmapdim * colmapdim; int n = threadIdx.x + blockIdx.x * blockDim.x; if ( n >= nelems * nentries_per_ele ) return; int e = n / nentries_per_ele; // set element int i = (n - e * nentries_per_ele) / rowmapdim; // local row int j = (n - e * nentries_per_ele - i * colmapdim); // local column // Compute position in values array int offset = pos(rowmap[e * rowmapdim + i], colmap[e * colmapdim + j], rowptr, colidx); __atomic_add(csrdata + offset, lmadata[n]); } .. _gpu_solve: GPU linear algebra ------------------ Linear algebra on the GPU with the ``cuda`` backend uses the Cusp_ library, which does not support all solvers and preconditioners provided by PETSc_. The interface to the user is the same as for the ``sequential`` and ``openmp`` backends. Supported solver types are CG (``cg``), GMRES (``gmres``) and BiCGStab (``bicgstab``), with preconditioners of types Jacobi (``jacobi``), Bridson approximate inverse (``ainv``) and asymptotic multigrid (``amg``). An exception is raised if an unsupported solver or preconditioner type is requested. A Cusp_ solver with the chosen parameters is automatically generated when :func:`~pyop2.solve` is called. .. note :: Distributed parallel linear algebra operations with MPI are currently not supported by the ``cuda`` backend. .. _PETSc: http://www.mcs.anl.gov/petsc/ .. _petsc4py: http://pythonhosted.org/petsc4py/ .. _MatSetValues: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Mat/MatSetValues.html .. _MatAssemblyBegin: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Mat/MatAssemblyBegin.html .. _MatAssemblyEnd: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Mat/MatAssemblyEnd.html .. _KSP: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/KSP/ .. _PC: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/PC/ .. _PETSc manual: http://www.mcs.anl.gov/petsc/petsc-dev/docs/manual.pdf .. _Cusp: http://cusplibrary.github.io