Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Arm Performance Libraries 24.10
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • performance analysis
  • Library
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Performance Libraries 24.10

Chris Goodyer
Chris Goodyer
November 11, 2024
5 minute read time.

Arm Performance Libraries (Arm PL) provides optimized standard core math libraries for numerical applications on 64-bit Arm (AArch64) processors. These are built with OpenMP parallelism for BLAS, LAPACK, FFT, and sparse routines to maximize performance in multi-processor environments.  In addition, high performing random number generation and scalar and vector math.h routines are also included.

One of the ways to get Arm PL for Linux is to download the latest version of Arm Compiler for Linux (ACfL). The ACfL 24.10 release is updated to LLVM 19, which includes code quality improvements along with improved support for function multiversioning. For full details of the improvements in LLVM 19, read our LLVM 19 blog post, and to understand how function multiversioning can help you leverage new Arm architecture features, see our function multiversioning learning path.

In addition to the version of Arm PL contained in ACfL, the libraries are also available on their own for Linux, macOS and Windows. These standalone versions of Arm PL are available for download here. The standalone versions for Linux are compatible with GCC, NVHPC, and for the first time we have made a beta release version of Arm PL using LLVM available. The LLVM version is compatible with the LLVM clang C and flang Fortran compilers and the libomp OpenMP shared memory parallel runtime.

Arm PL 24.10 features many performance improvements across the various constituent components. In this blog post, we highlight some of these for parallel matrix-matrix multiplication (GEMM), Fast Fourier Transforms (FFTs), Mersenne Twister Random Number Skip Ahead and vectorized trigonometric functions. Other changes, including the update to LAPACK 3.12.0, are called out in the full release notes.

Matrix-matrix Multiplication Performance improvements

The performance of highly parallel single and double-precision real matrix-matrix multiplication has been improved in Arm PL 24.10 for Arm Neoverse V2 systems such as NVIDIA Grace and AWS Graviton4. The libraries are now competitive with the NVIDIA Performance Libraries (NVPL) on Grace systems for multi-threaded problems using full sockets. The results below compare the performance of Arm PL 24.10, NVPL 24.7 and OpenBLAS (built from 6a60eb1 using TARGET=NEOVERSEV1). Arm PL now matches NVPL performance, achieving over 80% of theoretical peak performance for the largest problems when using 72 cores (all cores in a single socket). The inset graphs show the performance of small problems (using a logarithmic y-axis) is on average best with Arm PL. This can be attributed to "thread throttling" being performed, i.e. making sure not to use too many threads for smaller problems which would degrade performance. Thread throttling brings benefits to real applications, which often make library calls to solve small problems, in addition to large problems, with high thread counts being set e.g. with OMP_NUM_THREADS set to the number of cores available. Thread throttling is utilized by all BLAS functions within Arm PL.

Arm PL 24.10, NVPL and OpenBLAS DGEMM performance using 72 threads

Fast Fourier Transform improvements

FFTs in Arm PL have also been optimized in the 24.10 release to perform best-in-class for large 1-d problems. Prior to 24.10 FFT problems were only executed in parallel for batched or multi-dimensional cases. From 24.10 we have enabled parallelism for 1-d problems, which benefits long transform sizes. The graphs below show the performance of two large problems using up to 20 threads on an NVIDIA Grace system with Arm PL 24.10, FFTW 3.3.10 and NVPL 24.7. The transform lengths are 12! = 479001600 and 109. Parallelism for 1-d problems in Arm PL is implemented across the factors of the transform length which, for small enough factors, correspond to different compute kernels. Hence, choosing 12! for the benchmark case is interesting because it involves a variety of kernels being executed in parallel. The results show that Arm PL benefits from the best serial performance (one thread) and this trend scales up to 20 threads, where performance plateaus for all libraries.

Mersenne Twister Skip-Ahead performance improvements

MT19937 skip-ahead in Arm PL 24.10 has seen a 100 fold performance improvement over Arm PL 24.04. MT19937 is a pseudo-random number generator that implements a Mersenne Twister with a period of 219937-1. For applications relying on pseudo-random number generation that wish to scale to multiple threads, MT19937 skip-ahead enables each thread to generate a unique and reproducible pseudo-random number sequence. The graph on the left shows the time taken to skip-ahead a number of elements that is a power of 2 and, we can see 24.10 is an order of 100 times faster than 24.04. The graph on the right shows performance improvements for skip-aheads that are one less than a power of 2. These skip-ahead values, where a large number of bits are set in its two's complement representation, benefit the least from the recent optimizations, and still have around a 10 times performance improvement.

MT19937 skip-ahead improvement in 24.10 for powers of 2 sizes MT19937 skip-ahead improvement in 24.10 for sizes one less than powers of 2

The source code for the MT19937 skip-ahead performance improvements has been made publicly available in OpenRNG 24.10: https://gitlab.arm.com/libraries/openrng.

Libamath performance improvements

ArmPL 24.10 comes with a new macOS-specific build of libamath, containing optimized scalar math routines.

The new Linux build offers optimized versions of single and double-precision Neon and SVE modf and sincospi. The modf and modff algorithms are exact and do not produce rounding error, while sincospi and sincospif are accurate to 3.2ULP. On Neon targets this results in about 4x speedup for sincospi(f), 3.5x for modf and 1.5x for modff. On SVE targets, this results in 2.5x speedup for sincospif, 2x for sincospi and modff, and 1.5x for modf. 

Conclusions

The latest release of Arm Performance Libraries 24.10 is out now, available for Linux, macOS and Windows.  These can be used to accelerate a wide variety of use cases, including engineering, scientific and machine learning cases. Let us know in the forum if you have any questions.   

Anonymous
Servers and Cloud Computing blog
  • How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

    Peter Ma
    Peter Ma
    Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
    • July 4, 2025
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025