Sparse vectors and matrices are ones where most of the elements are zero. They quite often arise as a result of turning a continuous mathematical model of a problem into a discrete algorithm that can be programmed to run on a computer. Significant memory and time savings can be made by avoiding storing and computing the zero elements of these matrices. Therefore, optimized library support for sparse linear algebra operations is an important part of a numerical software developer's toolkit. Arm Performance Libraries (Arm PL) includes key sparse linear algebra functions (such as sparse matrix-vector and matrix-matrix multiplication) optimized for Arm (AArch64) systems, providing support for C/C++ and Fortran developers on Arm.

In Arm PL 23.04 we have introduced a sparse triangular matrix solve function and extended the library to include, for the first time, functions which operate on sparse vectors. The new functionality is available in single-precision (fp32) and double-precision (fp64) for real and complex types, with interfaces for both C/C++ and Fortran. The release also includes a selection of examples.

Summarizing the new sparse functionality added in Arm PL 23.04:

- armpl_spsv_exec: Sparse triangular matrix solve.
- armpl_spvec_create & armpl_spvec_destroy: The new sparse vector functions work with a new type,
`armpl_spvec_t`

, which complements the existing sparse matrix type`armpl_spmat_t`

. Use armpl_spvec_create to initialize an`armpl_spvec_t`

variable, and use armpl_spvec_destroy to free any memory allocated by the library for the type. - armpl_spwaxpby_exec: Scale a sparse vector and accumulate into a dense vector.
- armpl_spdot_exec : Dot product of a sparse vector and a dense vector.
- armpl_sprot_exec: Plane rotation of points defined using a sparse vector and a dense vector.
- armpl_spvec_gather: Copy non-zero elements from a dense vector into a sparse vector.
- armpl_spvec_scatter: Populate a dense vector with elements from a sparse vector.
- armpl_spvec_update: Update the values of non-zero elements in a sparse vector.
- armpl_spvec_export: Export the length, number of non-zeros and the elements of a sparse vector.
- armpl_spmat_printerr & armpl_spvec_printerr: Print the details of any errors associated with a sparse object.

### Sparse Triangular Solve

Finding the solution to a linear system *Ax = αy* (i.e. finding the vector *x*, given the matrix *A* and the vector *y* multiplied by the scalar *α*)* *is a problem common in numerical computing across almost all domains of science and engineering. Calls to `armpl_spsv_exec`

solve this problem when *A* is a sparse upper or lower triangular matrix and *x *and *y *are dense vectors. It is also possible to solve using the transpose of the matrix: *A ^{T}x = αy *or, for complex valued matrices, the conjugate-transpose:

*A*

^{*}x = αy.The new function follows the same *inspector-executor* workflow as used by the other sparse matrix operators in Arm PL. The following simple C example program demonstrates the workflow: it creates a sparse matrix object, provides hints about the problem we are about to solve, calls an optimization function, executes the solve, and finally destroys the sparse matrix object:

/* * Double precision sparse matrix triangular solve example * * ARMPL version 23.04 Copyright Arm 2023 */ #include <stdio.h> #include <stdlib.h> #include "armpl.h" #define NNZ 18 #define N 8 int main() { /* 1. Set-up local CSR structure */ armpl_spmat_t armpl_mat; const double alpha = 2.0; armpl_int_t creation_flags = 0; double vals[NNZ] = {1., 1., 1., 1., 1., 2., 3., 1., 3., 4., 1., 4., 5., 1., 11., 14., 16., 18.}; armpl_int_t row_ptr[N+1] = {0, 4, 7, 10, 13, 15, 16, 17, 18}; armpl_int_t col_indx[NNZ] = {0, 2, 5, 7, 1, 3, 6, 2, 4, 5, 3, 4, 6, 4, 7, 5, 6, 7}; /* 2. Set-up Arm Performance Libraries sparse matrix object */ armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat, N, N, row_ptr, col_indx, vals, creation_flags); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; } /* 3. Supply any hints that are about the SpTRSV calculations to be performed */ info = armpl_spmat_hint(armpl_mat, ARMPL_SPARSE_HINT_SPSV_OPERATION, ARMPL_SPARSE_OPERATION_NOTRANS); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; } /* 4. Call an optimization process that will learn from the hints you have previously supplied */ info = armpl_spsv_optimize(armpl_mat); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; } /* 5. Setup input and output vectors and then do Spsv and print result.*/ double *x = malloc(N*sizeof(double)); double *y = malloc(N*sizeof(double)); for (int i=0; i<N; i++) { y[i] = 2.0 + (double)(i); } info = armpl_spsv_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, armpl_mat, x, alpha, y); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; } printf("Computed solution vector x:\n"); for (int i=0; i<N; i++) { printf(" %4.1f\n", x[i]); } /* 6. Destroy created matrix to free any memory */ info = armpl_spmat_destroy(armpl_mat); if (info!=ARMPL_STATUS_SUCCESS) printf("ERROR: armpl_spmat_destroy returned %d\n", info); /* 7. Free user allocated storage */ free(x); free(y); return (int)info; }

During the optimization step, if the `ARMPL_SPARSE_HINT_SPSV_INVOCATIONS `

hint type is set to `ARMPL_SPARSE_INVOCATIONS_MANY`

then the library spends some time analyzing the sparse matrix to improve performance of the execution call, on the basis that the cost of optimization will be amortized by the time saved when repeatedly executing a solve using the same matrix within a loop, which is common in sparse matrix applications. In Arm PL 23.04, the only optimization currently enabled is to create a parallel decomposition of the matrix. So the optimization call returns immediately if the number of threads is 1 (i.e. if * *`OMP_NUM_THREADS=1`

).

We have compared the performance of Arm PL's triangular matrix solve functionality to those provided by other vendor libraries using the following setups:

- Arm PL 23.04 on AWS Graviton3, AWS EC2 c7g.metal
- Intel's MKL in OneAPI 2023.0.0 on 3rd Gen. Intel Xeon, AWS EC2 c6i.metal
- AMD's AOCL-sparse release 4.0.0 on 3rd Gen. AMD EPYC, AWS EC2 c6a.metal
- NVIDIA's cuSPARSE in NVIDIA HPC SDK V21.7 on an A100 GPU

The performance results from solving 6 matrices from the SuiteSparse Matrix Collection are given below when using 1, 8 and 16 threads for Arm PL and MKL. AOCL does not appear to have a parallel triangular solve implementation, so only the result with 1 thread is shown. cuSPARSE results are given alongside the 16 threads cases (on the basis that since it is using a GPU it must be a parallel implementation). The bar chart gives the time in seconds taken for the optimization call and 100 execution calls; note the logarithmic scale. The bars for each library are paired next to each other on the chart for a given thread count. The optimization call is on the left, in a slightly lighter shade than the execution call on the right. When using 1 thread AOCL and Arm PL return immediately from the optimization call (for reasons mentioned above for Arm PL), therefore those bars do not appear on the chart. All benchmark results were verified by checking that they produce a suitably small residual, |α*y - A*x*|.

The details of the matrices are as follows:

Matrix | Dimension (N) | Number of Non-zeros (NNZ) |
---|---|---|

Fault_639 | 638 802 | 14 626 683 |

Hook_1498 | 1 498 023 | 31 207 734 |

PFlow_742 | 742 793 | 18 940 627 |

Queen_4147 | 4 147 110 | 166 823 197 |

dielFilterV2real | 1 157 456 | 24 848 204 |

dielFilterV3real | 1 102 824 | 45 204 422 |

The results show that Arm PL running on Graviton3 is always competitive with other vendor libraries running on native hardware, and in some cases the best performance is on Arm.

Performing a sparse triangular matrix solve tends to be a memory-bound operation when using matrices such as these, whose memory requirements exceed the capacity of fast levels of data cache. This condition is especially true when only using a single core, and the single-threaded cases perform especially well on Graviton3. The degree of parallelism available depends on the structure of the sparse matrix. So the benefits of using multiple threads vary from one matrix to another. But the performance of Arm PL with 8 and 16 threads on Graviton3 matches very closely the performance of MKL on Xeon, for both the optimization and execution calls. These results also show that performing a sparse triangular matrix solve is perhaps better suited to multicore CPUs than a GPU: when using 16 threads both Arm PL and MKL outperform cuSPARSE.

### Sparse Vector Support

Three types of computation operations involving sparse vectors are supported in Arm PL 23.04: plane rotations, dot product and "axpy"; in each case, the operation is on a vector in sparse format and a vector in dense format. We provide some specializations of each of these operations:

- Plane rotations functions:
- Dot product functions, where
*x*is a sparse vector and*y*is a dense vector: - axpy functions, where
*x*is a sparse vector,*w*and*y*are dense vectors, and*α*and*β*are scalars:

In each case, operations are only carried out on the elements which correspond to non-zero values in the sparse vector. For example, plane rotation is defined as:

for (i=0; i<nnz; i++) { temp = c*x[i] + s*y[index[i]]; y[idx] = c*y[index[i]] - s*x[i]; x[i] = temp; }

Where x is the sparse matrix, y is the dense matrix, and c and s are the cosine and sine components passed to the function, which define the rotation matrix: .

See the Arm PL 23.04 release for C and Fortran examples of using armpl_sprot_exec. Note vector operations do not give the same scope for optimization as for matrices (since the complexity is just O(nnz)), so the workflow is slightly simpler: create a sparse vector variable (`armpl_spvec_t`

) using armpl_spvec_create, and then perform the operation using the appropriate execution function (e.g. armpl_sprot_exec). To access the details of the resulting sparse vector export it using armpl_spvec_export, and finally destroy the sparse vector variable once it is no longer needed, using armpl_spvec_destroy.

Arm PL offers some flexibility in this workflow with the introduction of extra utility functions. For example, an alternative way to create an `armpl_spvec_t`

if you already have a dense format representation of a vector (i.e. explicitly storing zeros) is to "gather" the non-zero elements into a sparse vector using armpl_spvec_gather (e.g. see the code snippet below). To complement this we also provide armpl_spvec_scatter, which populates a dense vector with the elements of a sparse vector. Scatter could be used as an alternative to armpl_spvec_export as described in the previous workflow. Sometimes it is necessary to update the values in a sparse vector, without changing its sparse structure. It is possible to do this update in Arm PL using armpl_spvec_update without having to create a new variable. The sparse vector export and update utilities match those already provided for sparse matrices in Arm PL.

// y is a dense format vector (double *y) containing many zeros, which we // want to turn into a sparse format vector x before doing a plane rotation // involving another dense format vector z ... // Gather y into new sparse vector variable x armpl_spvec_t x; armpl_status_t info = armpl_spvec_gather(y, 0, n, &x, 0); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; } // Perform the plane rotation info = armpl_sprot_exec(x, &z[0], c, s); if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; } ...

### Summary

With the new functions introduced in this blog, Arm PL has taken another step towards providing numerical software developers with the same optimized sparse linear algebra functionality that can be found in libraries optimized for other architectures. And we have shown that performance is very competitive on Arm. We look forward to hearing feedback on these new functions, and also hearing about more sparse functionality which developers want to see performing well on Arm in the future.