Introducing New Sparse Functions in Arm Performance Libraries 25.07

July 16, 2025

6 minute read time.

With growing momentum behind a new community-driven effort to define a modern standard, Arm has added new sparse linear algebra functions in the latest release of Arm Performance Libraries 25.07.

In this blog, we introduce the new sparse functions added in Arm Performance Libraries (ArmPL), and explain how they align with an emerging cross-vendor effort to build an API standard in collaboration with academic and industry peers. We also take a closer look at one of the most performance-critical additions, sampled dense-dense matrix multiplication (SDDMM) and share performance insights based on benchmarks running on Arm Neoverse-based systems.

The Challenge of Standardizing Sparse Linear Algebra APIs

Optimized sparse linear algebra libraries have always lacked a standardized API that has been widely adopted by vendor maths library implementors or numerical application developers. The use and development of sparse libraries is complicated by the fact that performance very heavily depends on the structure (or lack thereof) in your sparse matrices. Unlike in dense linear algebra, there are many different formats and algorithms to choose from. Hence, having many interfaces to work with in a multi-platform application only adds to the complexity. This is a problem, because sparse matrix operations underpin many important applications in science and engineering (e.g. solving large systems of differential equations) as well as in Machine Learning (ML) (e.g. performing factor analysis).

Motivated by a fresh community effort to define a modern standard, we have added new sparse functions into the latest release of Arm Performance Libraries. I had the pleasure of attending three workshops at the University of Tennessee where collaborators from academia and industry discussed the requirements, scope and design choices for the new standard. The workshops have been supported by weekly meetings where we have developed a C++ reference implementation proposal on github as well as a position paper [1] to capture our thoughts. Representatives from industry include other vendor maths library developers from Intel, AMD and NVIDIA. Currently, all vendor libraries provide different interfaces for matching sparse linear algebra operations.

There have been past attempts to standardize APIs in this area but they failed to gain traction, for various reasons. This recent effort is in recognition of the limitations and pitfalls encountered in the past, as well as acknowledgement of changes in computing such as the prevalence of heterogenous systems featuring CPUs and GPUs, which necessitate a different approach. Additionally, there remains an appetite from users and vendor maths library developers to work with standard interfaces to aid collaboration and application portability.

A key concern of the reference implementation proposal has been vendor library integration: how easily can we wrap existing libraries behind the new interfaces? ArmPL has been included in the proofs-of-concept by wrapping up our existing inspector-executor style API behind prototype interfaces for the standard.

One of the more settled aspects of the position paper is the list of functionality on which we've agreed to focus for the initial iteration of the standard. Based on the list in the paper we have worked on adding some of the missing functionality for ArmPL 25.07. The new functions have been introduced in the style of ArmPL's existing inspector-executor API, ready to be wrapped into an implementation of the emerging standard. The list contains the following operations:

Operation	ArmPL 25.04	ArmPL 25.07
Matrix scaling		armpl_spscale_exec_*
Matrix transpose		armpl_sptranspose_exec_*
Matrix Norm		armpl_spnorm_exec_*
Elementwise matrix multiplication		armpl_spelmm_optimize armpl_spelmm_exec_*
Matrix addition	armpl_spadd_optimize armpl_spadd_exec_*
Matrix-matrix multiplication	armpl_spmm_optimize armpl_spmm_exec_*
Matrix-vector multiplication	armpl_spmv_optimize armpl_spmv_exec_*
Triangular solve	armpl_spsv_optimize armpl_spsv_exec_*
Format conversion	armpl_export_<fmt>_*
Predicate selection		Not yet implemented
Sampled dense-dense multiplication		armpl_sddmm_optimize armpl_sddmm_exec_*

Entries in the second column indicate where ArmPL already had the functionality covered prior to the 25.07 release. The third column shows where we have additionally provided new support in the 25.07 release. All execution (exec) functions are supported for single and double precision, real and complex floating point types. New functions are fully documented in the online ArmPL Reference Guide, and examples are provided in the release. Note that only predicate selection has not yet been implemented in ArmPL. This is because there is still a lack of detail in the proposals about exactly how predicates are to be passed into a library with a C (rather than C++) interface.

The first three of the new functions are relatively simple. Sparse matrix scaling(A := αA) and transpose (A := A^T or A := A^*) functions operate in-place on an armpl_spmat_t matrix, and users can select for the matrix norm function to return either the Infinity or Frobenius norm of a given matrix. These operations are not expected to benefit from inspection before execution, so there is no corresponding optimize function. In contrast, the functions to perform elementwise matrix multiplication (C := α(op(A) ⊙ op(B)) + βC, where ⊙ stands for element-by-element multiplication of two matrices) and sampled dense-dense matrix multiplication (SDDMM - see below for definition) can benefit from prior inspection since performance depends on the structure of the input sparse matrices.

Sampled dense-dense matrix multiplication (SDDMM) performance

Of the newly added functions, SDDMM is probably the most performance critical, particularly for ML applications [2]. This operation is defined as follows:

C := α(op(A) · op(B)) ⊙ spy(C) + βC

where A and B are dense matrices, C is a sparse matrix; α and β are scalar multipliers; op(A) is one of A, A^T or A^* (i.e. the matrix, its transpose or its conjugate transpose); ⊙ denotes elementwise matrix multiplication as above; and spy(C) is a matrix with the same sparsity pattern as C, but with all non-zero values replaced by 1.

In other words, this is equivalent to performing a dense general matrix-matrix multiplication (GEMM) and then selecting and storing only the elements which correspond to the non-zero values in the sparse matrix C.

The sparse matrices used in ML workloads are typically much less sparse than those seen in traditional High Performance Computing (HPC) - i.e. sparse ML matrices have a higher ratio of non-zero values. For ML applications [3] highlights that matrix sparsity is around 70-90%, in contrast with >99% sparsity as is common in HPC.

The easiest way to implement this function is to perform a dense GEMM using ArmPL and then accumulate the values corresponding to non-zero values in the sparse matrix C back into C. The graph below compares the performance of ArmPL's SDDMM with this "GEMM + selection" approach for some of the problem dimensions cited in [3]. The comparison runs were performed for the single precision real functions (i.e. armpl_sddmm_exec_s and SGEMM) on an NVIDIA Grace system with 144 Arm Neoverse V2 cores. Both the SDDMM and SGEMM functions are optimized with OpenMP multithreading in ArmPL, and we did not constrain the number of threads to use (i.e. the library was free to use up to the maximum of 144 on this system). Note that ArmPL is set up to take care of using fewer threads than the maximum available if it is better for performance.

Obviously as the sparsity increases into the domain of HPC matrices (>99% sparse) the new function is many times better than using "GEMM + select", but the results show a performance benefit in using ArmPL for sparsity below 99%, i.e. in the range of the sparsity of matrices from ML applications.

Future

We look forward to collaborating more on the emerging standard and being in a position to adopt the new interfaces in ArmPL once the standard is agreed upon. In the meantime, we will try to incorporate the new functionality as part of vendor integration proof-of-concept efforts, and also adapt ArmPL itself to make integration easier. Adding this new functionality in ArmPL 25.07 puts us in a better position to consider how to approach future integration.

References

Hartwig Anzt et al., Interface for Sparse Linear Algebra Operations, arXiv preprint arXiv:2411.13259, Nov. 2024. https://arxiv.org/abs/2411.13259
Israt Nisa, Aravind Sukumaran-Rajam, Süreyya Emre Kurt, Changwan Hong, and P. Sadayappan, Sampled Dense Matrix Multiplication for High‑Performance Machine Learning, in Proc. IEEE Int’l Conf. on High Performance Computing (HiPC), 2018, pp. 32–41. https://doi.org/10.1109/HiPC.2018.00013
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen, Sparse GPU Kernels for Deep Learning, arXiv preprint arXiv:2006.10901, Aug. 2020. https://arxiv.org/abs/2006.10901

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Introducing New Sparse Functions in Arm Performance Libraries 25.07

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3