Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • SystemReady Certification
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog New sparse functions in Arm Performance Libraries 23.04
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm Compiler for Linux
  • Server and Infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

New sparse functions in Arm Performance Libraries 23.04

Chris Armstrong
Chris Armstrong
May 9, 2023
9 minute read time.

Sparse vectors and matrices are ones where most of the elements are zero. They quite often arise as a result of turning a continuous mathematical model of a problem into a discrete algorithm that can be programmed to run on a computer. Significant memory and time savings can be made by avoiding storing and computing the zero elements of these matrices. Therefore, optimized library support for sparse linear algebra operations is an important part of a numerical software developer's toolkit. Arm Performance Libraries (Arm PL) includes key sparse linear algebra functions (such as sparse matrix-vector and matrix-matrix multiplication) optimized for Arm (AArch64) systems, providing support for C/C++ and Fortran developers on Arm.

In Arm PL 23.04 we have introduced a sparse triangular matrix solve function and extended the library to include, for the first time, functions which operate on sparse vectors. The new functionality is available in single-precision (fp32) and double-precision (fp64) for real and complex types, with interfaces for both C/C++ and Fortran. The release also includes a selection of examples.

Summarizing the new sparse functionality added in Arm PL 23.04:

  • armpl_spsv_exec: Sparse triangular matrix solve.
  • armpl_spvec_create & armpl_spvec_destroy: The new sparse vector functions work with a new type, armpl_spvec_t, which complements the existing sparse matrix type armpl_spmat_t. Use armpl_spvec_create to initialize an armpl_spvec_t variable, and use armpl_spvec_destroy to free any memory allocated by the library for the type.
  • armpl_spwaxpby_exec: Scale a sparse vector and accumulate into a dense vector.
  • armpl_spdot_exec : Dot product of a sparse vector and a dense vector.
  • armpl_sprot_exec: Plane rotation of points defined using a sparse vector and a dense vector.
  • armpl_spvec_gather: Copy non-zero elements from a dense vector into a sparse vector.
  • armpl_spvec_scatter: Populate a dense vector with elements from a sparse vector.
  • armpl_spvec_update: Update the values of non-zero elements in a sparse vector.
  • armpl_spvec_export: Export the length, number of non-zeros and the elements of a sparse vector.
  • armpl_spmat_printerr & armpl_spvec_printerr: Print the details of any errors associated with a sparse object.

Sparse Triangular Solve

Finding the solution to a linear system Ax = αy (i.e. finding the vector x, given the matrix A and the vector y multiplied by the scalar α) is a problem common in numerical computing across almost all domains of science and engineering. Calls to armpl_spsv_exec solve this problem when A is a sparse upper or lower triangular matrix and x and y are dense vectors. It is also possible to solve using the transpose of the matrix: ATx = αy or, for complex valued matrices, the conjugate-transpose: A*x = αy.

The new function follows the same inspector-executor workflow as used by the other sparse matrix operators in Arm PL. The following simple C example program demonstrates the workflow: it creates a sparse matrix object, provides hints about the problem we are about to solve, calls an optimization function, executes the solve, and finally destroys the sparse matrix object:

/*
 * Double precision sparse matrix triangular solve example
 *
 * ARMPL version 23.04 Copyright Arm 2023
 */

#include <stdio.h>
#include <stdlib.h>
#include "armpl.h"

#define NNZ 18
#define N 8

int main()
{
    /* 1. Set-up local CSR structure */
    armpl_spmat_t armpl_mat;
    const double alpha = 2.0;
    armpl_int_t creation_flags = 0;
    double vals[NNZ] = {1., 1., 1., 1., 1., 2., 3., 1., 3., 4., 1., 4., 5., 1., 11., 14., 16., 18.};
    armpl_int_t row_ptr[N+1] = {0, 4, 7, 10, 13, 15, 16, 17, 18};
    armpl_int_t col_indx[NNZ] = {0, 2, 5, 7, 1, 3, 6, 2, 4, 5, 3, 4, 6, 4, 7, 5, 6, 7};

    /* 2. Set-up Arm Performance Libraries sparse matrix object */
    armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat, N, N, row_ptr, col_indx, vals, 
                                                   creation_flags);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 3. Supply any hints that are about the SpTRSV calculations to be performed */
    info = armpl_spmat_hint(armpl_mat, ARMPL_SPARSE_HINT_SPSV_OPERATION, 
                            ARMPL_SPARSE_OPERATION_NOTRANS);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 4. Call an optimization process that will learn from the hints you have previously supplied */
    info = armpl_spsv_optimize(armpl_mat);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    /* 5. Setup input and output vectors and then do Spsv and print result.*/
    double *x = malloc(N*sizeof(double));
    double *y = malloc(N*sizeof(double));
    for (int i=0; i<N; i++) {
        y[i] = 2.0 + (double)(i);
    }

    info = armpl_spsv_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, armpl_mat, x, alpha, y);
    if (info!=ARMPL_STATUS_SUCCESS) { armpl_spmat_print_err(armpl_mat); return (int)info; }

    printf("Computed solution vector x:\n");
    for (int i=0; i<N; i++) {
        printf("  %4.1f\n", x[i]);
    }

    /* 6. Destroy created matrix to free any memory */
    info = armpl_spmat_destroy(armpl_mat);
    if (info!=ARMPL_STATUS_SUCCESS) printf("ERROR: armpl_spmat_destroy returned %d\n", info);

    /* 7. Free user allocated storage */
    free(x); free(y);

    return (int)info;
}

During the optimization step, if the ARMPL_SPARSE_HINT_SPSV_INVOCATIONS hint type is set to ARMPL_SPARSE_INVOCATIONS_MANY then the library spends some time analyzing the sparse matrix to improve performance of the execution call, on the basis that the cost of optimization will be amortized by the time saved when repeatedly executing a solve using the same matrix within a loop, which is common in sparse matrix applications. In Arm PL 23.04, the only optimization currently enabled is to create a parallel decomposition of the matrix. So the optimization call returns immediately if the number of threads is 1 (i.e. if OMP_NUM_THREADS=1).

We have compared the performance of Arm PL's triangular matrix solve functionality to those provided by other vendor libraries using the following setups:

  • Arm PL 23.04 on AWS Graviton3, AWS EC2 c7g.metal
  • Intel's MKL in OneAPI 2023.0.0 on 3rd Gen. Intel Xeon, AWS EC2 c6i.metal
  • AMD's AOCL-sparse release 4.0.0 on 3rd Gen. AMD EPYC, AWS EC2 c6a.metal
  • NVIDIA's cuSPARSE in NVIDIA HPC SDK V21.7 on an A100 GPU

The performance results from solving 6 matrices from the SuiteSparse Matrix Collection are given below when using 1, 8 and 16 threads for Arm PL and MKL. AOCL does not appear to have a parallel triangular solve implementation, so only the result with 1 thread is shown. cuSPARSE results are given alongside the 16 threads cases (on the basis that since it is using a GPU it must be a parallel implementation). The bar chart gives the time in seconds taken for the optimization call and 100 execution calls; note the logarithmic scale. The bars for each library are paired next to each other on the chart for a given thread count. The optimization call is on the left, in a slightly lighter shade than the execution call on the right. When using 1 thread AOCL and Arm PL return immediately from the optimization call (for reasons mentioned above for Arm PL), therefore those bars do not appear on the chart. All benchmark results were verified by checking that they produce a suitably small residual, |αy - A*x|.

Bar chart showing the performance of sparse triangular solve for a selection of matrices and vendor libraries

The details of the matrices are as follows:

Matrix Dimension (N) Number of Non-zeros (NNZ)
Fault_639    638 802   14 626 683
Hook_1498 1 498 023   31 207 734
PFlow_742    742 793   18 940 627
Queen_4147 4 147 110 166 823 197
dielFilterV2real 1 157 456   24 848 204
dielFilterV3real 1 102 824   45 204 422

The results show that Arm PL running on Graviton3 is always competitive with other vendor libraries running on native hardware, and in some cases the best performance is on Arm.

Performing a sparse triangular matrix solve tends to be a memory-bound operation when using matrices such as these, whose memory requirements exceed the capacity of fast levels of data cache. This condition is especially true when only using a single core, and the single-threaded cases perform especially well on Graviton3. The degree of parallelism available depends on the structure of the sparse matrix. So the benefits of using multiple threads vary from one matrix to another. But the performance of Arm PL with 8 and 16 threads on Graviton3 matches very closely the performance of MKL on Xeon, for both the optimization and execution calls. These results also show that performing a sparse triangular matrix solve is perhaps better suited to multicore CPUs than a GPU: when using 16 threads both Arm PL and MKL outperform cuSPARSE.

Sparse Vector Support

Three types of computation operations involving sparse vectors are supported in Arm PL 23.04: plane rotations, dot product and "axpy"; in each case, the operation is on a vector in sparse format and a vector in dense format. We provide some specializations of each of these operations:

  • Plane rotations functions:
    • fp32 and fp64 real vectors;
    • fp32 and fp64 complex vectors where the sine component is a complex variable (matching e.g. LAPACK ZROT);
    • fp32 and fp64 complex vectors where the sine component is a real variable (matching e.g. BLAS ZDROT).
  • Dot product functions, where x is a sparse vector and y is a dense vector:
    • result = xT·y for fp32 and fp64 real vectors;
    • result = xT·y for fp32 and fp64 complex vectors;
    • result = x*·y for fp32 and fp64 complex vectors where the sparse vector is conjugated.
  • axpy functions, where x is a sparse vector, w and y are dense vectors, and α and β are scalars: 
    • y = αx + βy for fp32 and fp64 real vectors, and fp32 and fp64 complex vectors;
    • w = αx + βy (note separate result vector w) for fp32 and fp64 real vectors and fp32 and fp64 complex vectors.

In each case, operations are only carried out on the elements which correspond to non-zero values in the sparse vector. For example, plane rotation is defined as:

for (i=0; i<nnz; i++) {
    temp = c*x[i] + s*y[index[i]];
    y[idx] = c*y[index[i]] - s*x[i];
    x[i] = temp;
}
  
Where x is the sparse matrix, y is the dense matrix, and c and s are the cosine and sine components passed to the function, which define the rotation matrix:  .

See the Arm PL 23.04 release for C and Fortran examples of using armpl_sprot_exec. Note vector operations do not give the same scope for optimization as for matrices (since the complexity is just O(nnz)), so the workflow is slightly simpler: create a sparse vector variable (armpl_spvec_t) using armpl_spvec_create, and then perform the operation using the appropriate execution function (e.g. armpl_sprot_exec). To access the details of the resulting sparse vector export it using armpl_spvec_export, and finally destroy the sparse vector variable once it is no longer needed, using armpl_spvec_destroy.

Arm PL offers some flexibility in this workflow with the introduction of extra utility functions. For example, an alternative way to create an armpl_spvec_t if you already have a dense format representation of a vector (i.e. explicitly storing zeros) is to "gather" the non-zero elements into a sparse vector using armpl_spvec_gather (e.g. see the code snippet below). To complement this we also provide armpl_spvec_scatter, which populates a dense vector with the elements of a sparse vector. Scatter could be used as an alternative to armpl_spvec_export as described in the previous workflow.  Sometimes it is necessary to update the values in a sparse vector, without changing its sparse structure. It is possible to do this update in Arm PL using armpl_spvec_update without having to create a new variable. The sparse vector export and update utilities match those already provided for sparse matrices in Arm PL.

        // y is a dense format vector (double *y) containing many zeros, which we
        // want to turn into a sparse format vector x before doing a plane rotation
        // involving another dense format vector z
    
        ...
        
        // Gather y into new sparse vector variable x
        armpl_spvec_t x;
        armpl_status_t info = armpl_spvec_gather(y, 0, n, &x, 0);
        if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; }

        // Perform the plane rotation
        info = armpl_sprot_exec(x, &z[0], c, s);
        if (info!=ARMPL_STATUS_SUCCESS) { armpl_spvec_print_err(x); return (int)info; }

        ...

Summary

With the new functions introduced in this blog, Arm PL has taken another step towards providing numerical software developers with the same optimized sparse linear algebra functionality that can be found in libraries optimized for other architectures. And we have shown that performance is very competitive on Arm. We look forward to hearing feedback on these new functions, and also hearing about more sparse functionality which developers want to see performing well on Arm in the future.

Explore HPC Blogs

Anonymous
High Performance Computing (HPC) blog
  • Supercomputing 2023 - Change is Manifest

    David Lecomber
    David Lecomber
    In this blog we review recent announcements made about Arm in HPC and look forward to the Supercomputing 2023 tradeshow Arm is attending. Please come visit us in booth #1969.
    • November 8, 2023
  • Arm Neoverse-powered servers demonstrate HPC leadership

    David Lecomber
    David Lecomber
    In this blog, we compare the new Arm-based Hpc7g instance to the AMD-based Hpc6a instance type, across popular HPC applications on performance and cost.
    • July 12, 2023
  • Choosing Compilers for HPC on Arm

    David Lecomber
    David Lecomber
    In this blog, we look at compiler options for optimizing real-world HPC application performance and discuss considerations for choosing the right compiler for a particular application.
    • June 14, 2023