Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Introducing New Sparse Functions in Arm Performance Libraries 25.07
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • arm performance libraries
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Introducing New Sparse Functions in Arm Performance Libraries 25.07

Chris Armstrong
Chris Armstrong
July 16, 2025
6 minute read time.

With growing momentum behind a new community-driven effort to define a modern standard, Arm has added new sparse linear algebra functions in the latest release of Arm Performance Libraries 25.07.

In this blog, we introduce the new sparse functions added in Arm Performance Libraries (ArmPL), and explain how they align with an emerging cross-vendor effort to build an API standard in collaboration with academic and industry peers. We also take a closer look at one of the most performance-critical additions, sampled dense-dense matrix multiplication (SDDMM) and share performance insights based on benchmarks running on Arm Neoverse-based systems.

The Challenge of Standardizing Sparse Linear Algebra APIs

Optimized sparse linear algebra libraries have always lacked a standardized API that has been widely adopted by vendor maths library implementors or numerical application developers. The use and development of sparse libraries is complicated by the fact that performance very heavily depends on the structure (or lack thereof) in your sparse matrices. Unlike in dense linear algebra, there are many different formats and algorithms to choose from. Hence, having many interfaces to work with in a multi-platform application only adds to the complexity. This is a problem, because sparse matrix operations underpin many important applications in science and engineering (e.g. solving large systems of differential equations) as well as in Machine Learning (ML) (e.g. performing factor analysis).

Motivated by a fresh community effort to define a modern standard, we have added new sparse functions into the latest release of Arm Performance Libraries. I had the pleasure of attending three workshops at the University of Tennessee where collaborators from academia and industry discussed the requirements, scope and design choices for the new standard. The workshops have been supported by weekly meetings where we have developed a C++ reference implementation proposal on github as well as a position paper [1] to capture our thoughts. Representatives from industry include other vendor maths library developers from Intel, AMD and NVIDIA. Currently, all vendor libraries provide different interfaces for matching sparse linear algebra operations.

There have been past attempts to standardize APIs in this area but they failed to gain traction, for various reasons. This recent effort is in recognition of the limitations and pitfalls encountered in the past, as well as acknowledgement of changes in computing such as the prevalence of heterogenous systems featuring CPUs and GPUs, which necessitate a different approach. Additionally, there remains an appetite from users and vendor maths library developers to work with standard interfaces to aid collaboration and application portability. 

A key concern of the reference implementation proposal has been vendor library integration: how easily can we wrap existing libraries behind the new interfaces? ArmPL has been included in the proofs-of-concept by wrapping up our existing inspector-executor style API behind prototype interfaces for the standard.

One of the more settled aspects of the position paper is the list of functionality on which we've agreed to focus for the initial iteration of the standard. Based on the list in the paper we have worked on adding some of the missing functionality for ArmPL 25.07. The new functions have been introduced in the style of ArmPL's existing inspector-executor API, ready to be wrapped into an implementation of the emerging standard. The list contains the following operations:

Operation ArmPL 25.04 ArmPL 25.07
Matrix scaling

armpl_spscale_exec_*

Matrix transpose

armpl_sptranspose_exec_*

Matrix Norm

armpl_spnorm_exec_*

Elementwise matrix multiplication

armpl_spelmm_optimize

armpl_spelmm_exec_*

Matrix addition

armpl_spadd_optimize

armpl_spadd_exec_*

Matrix-matrix multiplication

armpl_spmm_optimize

armpl_spmm_exec_*

Matrix-vector multiplication

armpl_spmv_optimize

armpl_spmv_exec_*

Triangular solve

armpl_spsv_optimize

armpl_spsv_exec_*

Format conversion

armpl_export_<fmt>_*

Predicate selection

Not yet implemented

Sampled dense-dense multiplication

armpl_sddmm_optimize

armpl_sddmm_exec_*

Entries in the second column indicate where ArmPL already had the functionality covered prior to the 25.07 release. The third column shows where we have additionally provided new support in the 25.07 release. All execution (exec) functions are supported for single and double precision, real and complex floating point types. New functions are fully documented in the online ArmPL Reference Guide, and examples are provided in the release. Note that only predicate selection has not yet been implemented in ArmPL. This is because there is still a lack of detail in the proposals about exactly how predicates are to be passed into a library with a C (rather than C++) interface.

The first three of the new functions are relatively simple. Sparse matrix scaling(A := αA) and transpose (A := AT or A := A*) functions operate in-place on an armpl_spmat_t matrix, and users can select for the matrix norm function to return either the Infinity or Frobenius norm of a given matrix. These operations are not expected to benefit from inspection before execution, so there is no corresponding optimize function. In contrast,  the functions to perform elementwise matrix multiplication (C := α(op(A) ⊙ op(B)) + βC, where ⊙ stands for element-by-element multiplication of two matrices) and sampled dense-dense matrix multiplication (SDDMM - see below for definition) can benefit from prior inspection since performance depends on the structure of the input sparse matrices.

Sampled dense-dense matrix multiplication (SDDMM) performance

Of the newly added functions, SDDMM is probably the most performance critical, particularly for ML applications [2]. This operation is defined as follows:

C := α(op(A) · op(B)) ⊙ spy(C) + βC

where A and B are dense matrices, C is a sparse matrix; α and β are scalar multipliers; op(A) is one of A, AT or A* (i.e. the matrix, its transpose or its conjugate transpose); ⊙ denotes elementwise matrix multiplication as above; and spy(C) is a matrix with the same sparsity pattern as C, but with all non-zero values replaced by 1.

In other words, this is equivalent to performing a dense general matrix-matrix multiplication (GEMM) and then selecting and storing only the elements which correspond to the non-zero values in the sparse matrix C.

The sparse matrices used in ML workloads are typically much less sparse than those seen in traditional High Performance Computing (HPC) - i.e. sparse ML matrices have a higher ratio of non-zero values. For ML applications [3] highlights that matrix sparsity is around 70-90%, in contrast with >99% sparsity as is common in HPC.

The easiest way to implement this function is to perform a dense GEMM using ArmPL and then accumulate the values corresponding to non-zero values in the sparse matrix C back into C. The graph below compares the performance of ArmPL's SDDMM with this "GEMM + selection" approach for some of the problem dimensions cited in [3]. The comparison runs were performed for the single precision real functions (i.e. armpl_sddmm_exec_s and SGEMM) on an NVIDIA Grace system with 144 Arm Neoverse V2 cores. Both the SDDMM and SGEMM functions are optimized with OpenMP multithreading in ArmPL, and we did not constrain the number of threads to use (i.e. the library was free to use up to the maximum of 144 on this system). Note that ArmPL is set up to take care of using fewer threads than the maximum available if it is better for performance. 

Obviously as the sparsity increases into the domain of HPC matrices (>99% sparse) the new function is many times better than using "GEMM + select", but the results show a performance benefit in using ArmPL for sparsity below 99%, i.e. in the range of the sparsity of matrices from ML applications.

Future

We look forward to collaborating more on the emerging standard and being in a position to adopt the new interfaces in ArmPL once the standard is agreed upon. In the meantime, we will try to incorporate the new functionality as part of vendor integration proof-of-concept efforts, and also adapt ArmPL itself to make integration easier. Adding this new functionality in ArmPL 25.07 puts us in a better position to consider how to approach future integration.

References

  1. Hartwig Anzt et al., Interface for Sparse Linear Algebra Operations, arXiv preprint arXiv:2411.13259, Nov. 2024. https://arxiv.org/abs/2411.13259
  2. Israt Nisa, Aravind Sukumaran-Rajam, Süreyya Emre Kurt, Changwan Hong, and P. Sadayappan, Sampled Dense Matrix Multiplication for High‑Performance Machine Learning, in Proc. IEEE Int’l Conf. on High Performance Computing (HiPC), 2018, pp. 32–41. https://doi.org/10.1109/HiPC.2018.00013 
  3. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen, Sparse GPU Kernels for Deep Learning, arXiv preprint arXiv:2006.10901, Aug. 2020. https://arxiv.org/abs/2006.10901
Anonymous
Servers and Cloud Computing blog
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025
  • Distributed Generative AI Inference on Arm

    Waheed Brown
    Waheed Brown
    As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference…
    • August 18, 2025