With growing momentum behind a new community-driven effort to define a modern standard, Arm has added new sparse linear algebra functions in the latest release of Arm Performance Libraries 25.07.
In this blog, we introduce the new sparse functions added in Arm Performance Libraries (ArmPL), and explain how they align with an emerging cross-vendor effort to build an API standard in collaboration with academic and industry peers. We also take a closer look at one of the most performance-critical additions, sampled dense-dense matrix multiplication (SDDMM) and share performance insights based on benchmarks running on Arm Neoverse-based systems.
The Challenge of Standardizing Sparse Linear Algebra APIs
Optimized sparse linear algebra libraries have always lacked a standardized API that has been widely adopted by vendor maths library implementors or numerical application developers. The use and development of sparse libraries is complicated by the fact that performance very heavily depends on the structure (or lack thereof) in your sparse matrices. Unlike in dense linear algebra, there are many different formats and algorithms to choose from. Hence, having many interfaces to work with in a multi-platform application only adds to the complexity. This is a problem, because sparse matrix operations underpin many important applications in science and engineering (e.g. solving large systems of differential equations) as well as in Machine Learning (ML) (e.g. performing factor analysis).
Motivated by a fresh community effort to define a modern standard, we have added new sparse functions into the latest release of Arm Performance Libraries. I had the pleasure of attending three workshops at the University of Tennessee where collaborators from academia and industry discussed the requirements, scope and design choices for the new standard. The workshops have been supported by weekly meetings where we have developed a C++ reference implementation proposal on github as well as a position paper [1] to capture our thoughts. Representatives from industry include other vendor maths library developers from Intel, AMD and NVIDIA. Currently, all vendor libraries provide different interfaces for matching sparse linear algebra operations.
There have been past attempts to standardize APIs in this area but they failed to gain traction, for various reasons. This recent effort is in recognition of the limitations and pitfalls encountered in the past, as well as acknowledgement of changes in computing such as the prevalence of heterogenous systems featuring CPUs and GPUs, which necessitate a different approach. Additionally, there remains an appetite from users and vendor maths library developers to work with standard interfaces to aid collaboration and application portability.
A key concern of the reference implementation proposal has been vendor library integration: how easily can we wrap existing libraries behind the new interfaces? ArmPL has been included in the proofs-of-concept by wrapping up our existing inspector-executor style API behind prototype interfaces for the standard.
One of the more settled aspects of the position paper is the list of functionality on which we've agreed to focus for the initial iteration of the standard. Based on the list in the paper we have worked on adding some of the missing functionality for ArmPL 25.07. The new functions have been introduced in the style of ArmPL's existing inspector-executor API, ready to be wrapped into an implementation of the emerging standard. The list contains the following operations:
armpl_spscale_exec_*
armpl_sptranspose_exec_*
armpl_spnorm_exec_*
armpl_spelmm_optimize
armpl_spelmm_exec_*
armpl_spadd_optimize
armpl_spadd_exec_*
armpl_spmm_optimize
armpl_spmm_exec_*
armpl_spmv_optimize
armpl_spmv_exec_*
armpl_spsv_optimize
armpl_spsv_exec_*
armpl_export_<fmt>_*
Not yet implemented
armpl_sddmm_optimize
armpl_sddmm_exec_*
Entries in the second column indicate where ArmPL already had the functionality covered prior to the 25.07 release. The third column shows where we have additionally provided new support in the 25.07 release. All execution (exec) functions are supported for single and double precision, real and complex floating point types. New functions are fully documented in the online ArmPL Reference Guide, and examples are provided in the release. Note that only predicate selection has not yet been implemented in ArmPL. This is because there is still a lack of detail in the proposals about exactly how predicates are to be passed into a library with a C (rather than C++) interface.
exec
The first three of the new functions are relatively simple. Sparse matrix scaling(A := αA) and transpose (A := AT or A := A*) functions operate in-place on an armpl_spmat_t matrix, and users can select for the matrix norm function to return either the Infinity or Frobenius norm of a given matrix. These operations are not expected to benefit from inspection before execution, so there is no corresponding optimize function. In contrast, the functions to perform elementwise matrix multiplication (C := α(op(A) ⊙ op(B)) + βC, where ⊙ stands for element-by-element multiplication of two matrices) and sampled dense-dense matrix multiplication (SDDMM - see below for definition) can benefit from prior inspection since performance depends on the structure of the input sparse matrices.
armpl_spmat_t
optimize
Sampled dense-dense matrix multiplication (SDDMM) performance
Of the newly added functions, SDDMM is probably the most performance critical, particularly for ML applications [2]. This operation is defined as follows:
C := α(op(A) · op(B)) ⊙ spy(C) + βC
where A and B are dense matrices, C is a sparse matrix; α and β are scalar multipliers; op(A) is one of A, AT or A* (i.e. the matrix, its transpose or its conjugate transpose); ⊙ denotes elementwise matrix multiplication as above; and spy(C) is a matrix with the same sparsity pattern as C, but with all non-zero values replaced by 1.
In other words, this is equivalent to performing a dense general matrix-matrix multiplication (GEMM) and then selecting and storing only the elements which correspond to the non-zero values in the sparse matrix C.
The sparse matrices used in ML workloads are typically much less sparse than those seen in traditional High Performance Computing (HPC) - i.e. sparse ML matrices have a higher ratio of non-zero values. For ML applications [3] highlights that matrix sparsity is around 70-90%, in contrast with >99% sparsity as is common in HPC.
The easiest way to implement this function is to perform a dense GEMM using ArmPL and then accumulate the values corresponding to non-zero values in the sparse matrix C back into C. The graph below compares the performance of ArmPL's SDDMM with this "GEMM + selection" approach for some of the problem dimensions cited in [3]. The comparison runs were performed for the single precision real functions (i.e. armpl_sddmm_exec_s and SGEMM) on an NVIDIA Grace system with 144 Arm Neoverse V2 cores. Both the SDDMM and SGEMM functions are optimized with OpenMP multithreading in ArmPL, and we did not constrain the number of threads to use (i.e. the library was free to use up to the maximum of 144 on this system). Note that ArmPL is set up to take care of using fewer threads than the maximum available if it is better for performance.
Obviously as the sparsity increases into the domain of HPC matrices (>99% sparse) the new function is many times better than using "GEMM + select", but the results show a performance benefit in using ArmPL for sparsity below 99%, i.e. in the range of the sparsity of matrices from ML applications.
Future
We look forward to collaborating more on the emerging standard and being in a position to adopt the new interfaces in ArmPL once the standard is agreed upon. In the meantime, we will try to incorporate the new functionality as part of vendor integration proof-of-concept efforts, and also adapt ArmPL itself to make integration easier. Adding this new functionality in ArmPL 25.07 puts us in a better position to consider how to approach future integration.
References