Arm Performance Libraries (Arm PL) provides optimized standard core math libraries for numerical applications on 64-bit Arm (AArch64) processors. These are built with OpenMP parallelism for BLAS, LAPACK, FFT, and sparse routines to maximize performance in multi-processor environments. In addition, high performing random number generation and scalar and vector math.h routines are also included.
One of the ways to get Arm PL for Linux is to download the latest version of Arm Compiler for Linux (ACfL). The ACfL 24.10 release is updated to LLVM 19, which includes code quality improvements along with improved support for function multiversioning. For full details of the improvements in LLVM 19, please read our LLVM 19 blog, and to understand how function multiversioning can help you leverage new Arm architecture features, see our function multiversioning learning path.
In addition to the version of Arm PL contained in ACfL, the libraries are also available on their own for Linux, macOS and Windows. These standalone versions of Arm PL are available for download here. The standalone versions for Linux are compatible with GCC, NVHPC, and for the first time we have made a beta release version of Arm PL using LLVM available. The LLVM version is compatible with the LLVM clang C and flang Fortran compilers and the libomp OpenMP shared memory parallel runtime.
Arm PL 24.10 features many performance improvements across the various constituent components. In this blog we highlight some of these for parallel matrix-matrix multiplication (GEMM), Fast Fourier Transforms (FFTs), Mersenne Twister Random Number Skip Ahead and vectorized trigonometric functions. Other changes, including the update to LAPACK 3.12.0, are called out in the full release notes.
Matrix-Matrix Multiplication Performance improvements
The performance of highly parallel single and double precision real matrix-matrix multiplication has been improved in Arm PL 24.10 for Arm Neoverse V2 systems such as NVIDIA Grace and AWS Graviton4. The libraries are now competitive with the NVIDIA Performance Libraries (NVPL) on Grace systems for multi-threaded problems using full sockets. The results below compare the performance of Arm PL 24.10, NVPL 24.7 and OpenBLAS (built from 6a60eb1 using TARGET=NEOVERSEV1). Arm PL now matches NVPL performance, achieving over 80% of theoretical peak performance for the largest problems when using 72 cores (all cores in a single socket). The inset graphs show the performance of small problems (using a logarithmic y-axis) is on average best with Arm PL. This can be attributed to "thread throttling" being performed, i.e. making sure not to use too many threads for smaller problems which would degrade performance. Thread throttling brings benefits to real applications, which often make library calls to solve small problems, in addition to large problems, with high thread counts being set e.g. with OMP_NUM_THREADS set to the number of cores available. Thread throttling is utilized by all BLAS functions within Arm PL.
6a60eb1
TARGET=NEOVERSEV1
OMP_NUM_THREADS
Fast Fourier Transform improvements
FFTs in Arm PL have also been optimized in the 24.10 release to perform best-in-class for large 1-d problems. Prior to 24.10 FFT problems were only executed in parallel for batched or multi-dimensional cases. From 24.10 we have enabled parallelism for 1-d problems, which benefits long transform sizes. The graphs below show the performance of two large problems using up to 20 threads on an NVIDIA Grace system with Arm PL 24.10, FFTW 3.3.10 and NVPL 24.7. The transform lengths are 12! = 479001600 and 109. Parallelism for 1-d problems in Arm PL is implemented across the factors of the transform length which, for small enough factors, correspond to different compute kernels. Hence, choosing 12! for the benchmark case is interesting because it involves a variety of kernels being executed in parallel. The results show that Arm PL benefits from the best serial performance (one thread) and this trend scales up to 20 threads, where performance plateaus for all libraries.
Mersenne Twister Skip-Ahead performance improvements
MT19937 skip-ahead in Arm PL 24.10 has seen a 100 fold performance improvement over Arm PL 24.04. MT19937 is a pseudo-random number generator that implements a Mersenne Twister with a period of 219937-1. For applications relying on pseudo-random number generation that wish to scale to multiple threads, MT19937 skip-ahead enables each thread to generate a unique and reproducible pseudo-random number sequence. The graph on the left shows the time taken to skip-ahead a number of elements that is a power of 2 and, we can see 24.10 is an order of 100 times faster than 24.04. The graph on the right shows performance improvements for skip-aheads that are one less than a power of 2. These skip-ahead values, where a large number of bits are set in its two's complement representation, benefit the least from the recent optimisations, and still have around a 10 times performance improvement.
The source code for the MT19937 skip-ahead performance improvements has been made publicly available in OpenRNG 24.10: https://gitlab.arm.com/libraries/openrng.
Libamath performance improvements
ArmPL 24.10 comes with a new macOS-specific build of libamath, containing optimized scalar math routines.
The new Linux build offers optimized versions of single and double precision Neon and SVE modf and sincospi. The modf and modff algorithms are exact and do not produce rounding error, while sincospi and sincospif are accurate to 3.2ULP. On Neon targets this results in about 4x speedup for sincospi(f), 3.5x for modf and 1.5x for modff. On SVE targets, this results in 2.5x speedup for sincospif, 2x for sincospi and modff, and 1.5x for modf.
modf
sincospi
m
odf
modff
sincospif
sincospi(f)
Conclusions
The latest release of Arm Performance Libraries 24.10 is out now, available for Linux, macOS and Windows. These can be used to accelerate a wide variety of use cases, including engineering, scientific and machine learning cases. Please let us know via the forum if you have any questions.