New sparse functions in Arm Performance Libraries 23.04

May 9, 2023

9 minute read time.

Sparse vectors and matrices are ones where most of the elements are zero. They quite often arise as a result of turning a continuous mathematical model of a problem into a discrete algorithm that can be programmed to run on a computer. Significant memory and time savings can be made by avoiding storing and computing the zero elements of these matrices. Therefore, optimized library support for sparse linear algebra operations is an important part of a numerical software developer's toolkit. Arm Performance Libraries (Arm PL) includes key sparse linear algebra functions (such as sparse matrix-vector and matrix-matrix multiplication) optimized for Arm (AArch64) systems, providing support for C/C++ and Fortran developers on Arm.

In Arm PL 23.04 we have introduced a sparse triangular matrix solve function and extended the library to include, for the first time, functions which operate on sparse vectors. The new functionality is available in single-precision (fp32) and double-precision (fp64) for real and complex types, with interfaces for both C/C++ and Fortran. The release also includes a selection of examples.

Summarizing the new sparse functionality added in Arm PL 23.04:

armpl_spsv_exec: Sparse triangular matrix solve.
armpl_spvec_create & armpl_spvec_destroy: The new sparse vector functions work with a new type, armpl_spvec_t, which complements the existing sparse matrix type armpl_spmat_t. Use armpl_spvec_create to initialize an armpl_spvec_t variable, and use armpl_spvec_destroy to free any memory allocated by the library for the type.
armpl_spwaxpby_exec: Scale a sparse vector and accumulate into a dense vector.
armpl_spdot_exec : Dot product of a sparse vector and a dense vector.
armpl_sprot_exec: Plane rotation of points defined using a sparse vector and a dense vector.
armpl_spvec_gather: Copy non-zero elements from a dense vector into a sparse vector.
armpl_spvec_scatter: Populate a dense vector with elements from a sparse vector.
armpl_spvec_update: Update the values of non-zero elements in a sparse vector.
armpl_spvec_export: Export the length, number of non-zeros and the elements of a sparse vector.
armpl_spmat_printerr & armpl_spvec_printerr: Print the details of any errors associated with a sparse object.

Sparse Triangular Solve

Finding the solution to a linear system Ax = αy (i.e. finding the vector x, given the matrix A and the vector y multiplied by the scalar α) is a problem common in numerical computing across almost all domains of science and engineering. Calls to armpl_spsv_exec solve this problem when A is a sparse upper or lower triangular matrix and x and y are dense vectors. It is also possible to solve using the transpose of the matrix: A^Tx = αy or, for complex valued matrices, the conjugate-transpose: A^*x = αy.

The new function follows the same inspector-executor workflow as used by the other sparse matrix operators in Arm PL. The following simple C example program demonstrates the workflow: it creates a sparse matrix object, provides hints about the problem we are about to solve, calls an optimization function, executes the solve, and finally destroys the sparse matrix object:

/*
 * Double precision sparse matrix triangular solve example
 *
 * ARMPL version 23.04 Copyright Arm 2023
 */

#include <stdio.h>
#include <stdlib.h>
#include "armpl.h"

#define NNZ 18
#define N 8

int main()
{
    /* 1. Set-up local CSR structure */
    armpl_spmat_t armpl_mat;
    const double alpha = 2.0;
    armpl_int_t creation_flags = 0;
    double vals[NNZ] = {1., 1., 1., 1., 1., 2., 3., 1., 3., 4., 1., 4., 5., 1., 11., 14., 16., 18.};
    armpl_int_t row_ptr[N+1] = {0, 4, 7, 10, 13, 15, 16, 17, 18};
    armpl_int_t col_indx[NNZ] = {0, 2, 5, 7, 1, 3, 6, 2, 4, 5, 3, 4, 6, 4, 7, 5, 6, 7};

    /* 2. Set-up Arm Performance Libraries sparse matrix object */
    armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat, N, N, row_ptr, col_indx, vals, 
                                                   creation_flags);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 3. Supply any hints that are about the SpTRSV calculations to be performed */
    info = armpl_spmat_hint(armpl_mat, ARMPL_SPARSE_HINT_SPSV_OPERATION, 
                            ARMPL_SPARSE_OPERATION_NOTRANS);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 4. Call an optimization process that will learn from the hints you have previously supplied */
    info = armpl_spsv_optimize(armpl_mat);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 5. Setup input and output vectors and then do Spsv and print result.*/
    double *x = malloc(N*sizeof(double));
    double *y = malloc(N*sizeof(double));
    for (int i=0; i<N; i++) {
        y[i] = 2.0 + (double)(i);
    }

    info = armpl_spsv_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, armpl_mat, x, alpha, y);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    printf("Computed solution vector x:\n");
    for (int i=0; i<N; i++) {
        printf("  %4.1f\n", x[i]);
    }

    /* 6. Destroy created matrix to free any memory */
    info = armpl_spmat_destroy(armpl_mat);
    if (info!=ARMPL_STATUS_SUCCESS) printf("ERROR: armpl_spmat_destroy returned %d\n", info);

    /* 7. Free user allocated storage */
    free(x); free(y);

    return (int)info;
}

During the optimization step, if the ARMPL_SPARSE_HINT_SPSV_INVOCATIONS hint type is set to ARMPL_SPARSE_INVOCATIONS_MANY then the library spends some time analyzing the sparse matrix to improve performance of the execution call, on the basis that the cost of optimization will be amortized by the time saved when repeatedly executing a solve using the same matrix within a loop, which is common in sparse matrix applications. In Arm PL 23.04, the only optimization currently enabled is to create a parallel decomposition of the matrix. So the optimization call returns immediately if the number of threads is 1 (i.e. if OMP_NUM_THREADS=1).

We have compared the performance of Arm PL's triangular matrix solve functionality to those provided by other vendor libraries using the following setups:

Arm PL 23.04 on AWS Graviton3, AWS EC2 c7g.metal
Intel's MKL in OneAPI 2023.0.0 on 3rd Gen. Intel Xeon, AWS EC2 c6i.metal
AMD's AOCL-sparse release 4.0.0 on 3rd Gen. AMD EPYC, AWS EC2 c6a.metal
NVIDIA's cuSPARSE in NVIDIA HPC SDK V21.7 on an A100 GPU

The performance results from solving 6 matrices from the SuiteSparse Matrix Collection are given below when using 1, 8 and 16 threads for Arm PL and MKL. AOCL does not appear to have a parallel triangular solve implementation, so only the result with 1 thread is shown. cuSPARSE results are given alongside the 16 threads cases (on the basis that since it is using a GPU it must be a parallel implementation). The bar chart gives the time in seconds taken for the optimization call and 100 execution calls; note the logarithmic scale. The bars for each library are paired next to each other on the chart for a given thread count. The optimization call is on the left, in a slightly lighter shade than the execution call on the right. When using 1 thread AOCL and Arm PL return immediately from the optimization call (for reasons mentioned above for Arm PL), therefore those bars do not appear on the chart. All benchmark results were verified by checking that they produce a suitably small residual, |αy - A*x|.

Bar chart showing the performance of sparse triangular solve for a selection of matrices and vendor libraries

The details of the matrices are as follows:

Matrix	Dimension (N)	Number of Non-zeros (NNZ)
Fault_639	638 802	14 626 683
Hook_1498	1 498 023	31 207 734
PFlow_742	742 793	18 940 627
Queen_4147	4 147 110	166 823 197
dielFilterV2real	1 157 456	24 848 204
dielFilterV3real	1 102 824	45 204 422

The results show that Arm PL running on Graviton3 is always competitive with other vendor libraries running on native hardware, and in some cases the best performance is on Arm.

Performing a sparse triangular matrix solve tends to be a memory-bound operation when using matrices such as these, whose memory requirements exceed the capacity of fast levels of data cache. This condition is especially true when only using a single core, and the single-threaded cases perform especially well on Graviton3. The degree of parallelism available depends on the structure of the sparse matrix. So the benefits of using multiple threads vary from one matrix to another. But the performance of Arm PL with 8 and 16 threads on Graviton3 matches very closely the performance of MKL on Xeon, for both the optimization and execution calls. These results also show that performing a sparse triangular matrix solve is perhaps better suited to multicore CPUs than a GPU: when using 16 threads both Arm PL and MKL outperform cuSPARSE.

Sparse Vector Support

Three types of computation operations involving sparse vectors are supported in Arm PL 23.04: plane rotations, dot product and "axpy"; in each case, the operation is on a vector in sparse format and a vector in dense format. We provide some specializations of each of these operations:

Plane rotations functions:
- fp32 and fp64 real vectors;
- fp32 and fp64 complex vectors where the sine component is a complex variable (matching e.g. LAPACK ZROT);
- fp32 and fp64 complex vectors where the sine component is a real variable (matching e.g. BLAS ZDROT).
Dot product functions, where x is a sparse vector and y is a dense vector:
- result = x^T·y for fp32 and fp64 real vectors;
- result = x^T·y for fp32 and fp64 complex vectors;
- result = x*·y for fp32 and fp64 complex vectors where the sparse vector is conjugated.
axpy functions, where x is a sparse vector, w and y are dense vectors, and α and β are scalars:
- y = αx + βy for fp32 and fp64 real vectors, and fp32 and fp64 complex vectors;
- w = αx + βy (note separate result vector w) for fp32 and fp64 real vectors and fp32 and fp64 complex vectors.

In each case, operations are only carried out on the elements which correspond to non-zero values in the sparse vector. For example, plane rotation is defined as:

for (i=0; i<nnz; i++) {
    temp = c*x[i] + s*y[index[i]];
    y[idx] = c*y[index[i]] - s*x[i];
    x[i] = temp;
}

Where x is the sparse matrix, y is the dense matrix, and c and s are the cosine and sine components passed to the function, which define the rotation matrix:

See the Arm PL 23.04 release for C and Fortran examples of using armpl_sprot_exec. Note vector operations do not give the same scope for optimization as for matrices (since the complexity is just O(nnz)), so the workflow is slightly simpler: create a sparse vector variable (armpl_spvec_t) using armpl_spvec_create, and then perform the operation using the appropriate execution function (e.g. armpl_sprot_exec). To access the details of the resulting sparse vector export it using armpl_spvec_export, and finally destroy the sparse vector variable once it is no longer needed, using armpl_spvec_destroy.

Arm PL offers some flexibility in this workflow with the introduction of extra utility functions. For example, an alternative way to create an armpl_spvec_t if you already have a dense format representation of a vector (i.e. explicitly storing zeros) is to "gather" the non-zero elements into a sparse vector using armpl_spvec_gather (e.g. see the code snippet below). To complement this we also provide armpl_spvec_scatter, which populates a dense vector with the elements of a sparse vector. Scatter could be used as an alternative to armpl_spvec_export as described in the previous workflow. Sometimes it is necessary to update the values in a sparse vector, without changing its sparse structure. It is possible to do this update in Arm PL using armpl_spvec_update without having to create a new variable. The sparse vector export and update utilities match those already provided for sparse matrices in Arm PL.

        // y is a dense format vector (double *y) containing many zeros, which we
        // want to turn into a sparse format vector x before doing a plane rotation
        // involving another dense format vector z
    
        ...
        
        // Gather y into new sparse vector variable x
        armpl_spvec_t x;
        armpl_status_t info = armpl_spvec_gather(y, 0, n, &x, 0);
        if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; }

        // Perform the plane rotation
        info = armpl_sprot_exec(x, &z[0], c, s);
        if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; }

        ...

Summary

With the new functions introduced in this blog, Arm PL has taken another step towards providing numerical software developers with the same optimized sparse linear algebra functionality that can be found in libraries optimized for other architectures. And we have shown that performance is very competitive on Arm. We look forward to hearing feedback on these new functions, and also hearing about more sparse functionality which developers want to see performing well on Arm in the future.

[CTAToken URL = "https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog" target="_blank" text="Explore HPC Blogs" class ="green"]

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

New sparse functions in Arm Performance Libraries 23.04

Sparse Triangular Solve

Sparse Vector Support

Summary

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL