Arm Compiler for Linux and Arm Performance Libraries 23.04

May 9, 2023

8 minute read time.

Arm Compiler for Linux (ACfL) provides a complete compiling environment for natively developing and tuning your server and HPC applications on Arm-based platforms. It includes:

Arm C/C++ Compiler (armclang) and Arm Fortran Compiler (armflang) which are LLVM-based Linux user-space compilers tailored for building scientific computing, HPC, and enterprise workloads on Arm AArch64 systems.
Arm Performance Libraries (Arm PL) which contain optimized math functions, such as linear algebra and Fast Fourier Transforms, tuned for Arm AArch64 implementations, including implementations with SVE.

In this blog, we explore what is new in this first major release of 2023.

New release numbering

The Arm Compiler for Linux roadmap is changing to two major releases of ACfL per year each with an update to the base LLVM version. This cadence is to ensure that ACfL keeps up with state-of-the-art LLVM technology, especially for AArch64 and SVE-optimized code generation, and allows users to access the full performance of their Arm-based systems sooner. To signify this change, ACfL releases will now be numbered as YY.MM denoting year and month of release. This means 23.04 is out now, and 23.10 will be the next release in October and will be based on LLVM 17. Arm Performance Libraries numbering also matches this convention.

What is new in Arm Compiler for Linux 23.04?

A summary of the key features in this release:

Updated the base compiler technology to LLVM 16 (from LLVM 13) bringing performance and stability improvements.
Enhanced codegen for Arm8.3A Complex Number extension
New platform support for RHEL9, Ubuntu 22.04, and Amazon Linux 2023
New routines for sparse linear algebra, including parallel optimizations
The package now ships GCC 12 series of compilers, with many new performance features and improvements.

Full release notes can be found on developer.arm.com.

Compiler improvements

Autovectorization of Complex number arithmetic

To support Complex, Arm introduced Armv8.3-A instructions number addition and multiply-accumulate operations. ACfL supports these instructions as a vectorization target at -Ofast if supported by the target system.

Take the following vector reduction loop which multiplies elements of two arrays of complex numbers v and w and accumulates the result in complex variable x .

Example 1: Reduction loop on array of complex types

#include <complex>
using namespace std;
 
complex <double> foo(complex < double > v[ LEN ], complex < double > w[ LEN ]) {
  complex < double > x;
  for (int i = 0; i < LEN; ++i)
    x += v[i]*w[i];
  return x;
}

Compiling this example with armclang using -Ofast and targeting a system like Neoverse V1, which supports SVE and the Complex number extension gives the output on the left. Previous versions of ACfL would emit the sequence on the right.

ACfL 23.04 output using FCMLA

ACfL 22.1 output

.LBB0_1:                                
    add     x14, x0, x11
    add     x15, x1, x11
    ld1d    { z0.d }, p0/z, [x14]
    ld1d    { z3.d }, p0/z, [x14, #1, mul vl]
    ld1d    { z4.d }, p0/z, [x15]
    ld1d    { z5.d }, p0/z, [x15, #1, mul vl]
    add     x11, x11, x13
    subs    x12, x12, x10
    fcmla   z2.d, p0/m, z4.d, z0.d, #0
    fcmla   z1.d, p0/m, z5.d, z3.d, #0
    fcmla   z2.d, p0/m, z4.d, z0.d, #90
    fcmla   z1.d, p0/m, z5.d, z3.d, #90
    b.ne    .LBB0_1
// %bb.2:
    uzp1    z0.d, z2.d, z1.d
    uzp2    z1.d, z2.d, z1.d
    faddv   d0, p0, z0.d
    faddv   d1, p0, z1.d

.LBB0_1:                            
    add     x14, x0, x10
    add     x15, x1, x10
    add     x10, x10, x13
    subs    x12, x12, x9
    ld2d    { z2.d, z3.d }, p0/z, [x14]   
    ld2d    { z4.d, z5.d }, p0/z, [x15]
    fmla    z0.d, p0/m, z4.d, z2.d
    fmla    z1.d, p0/m, z4.d, z3.d
    fmls    z0.d, p0/m, z5.d, z3.d
    fmla    z1.d, p0/m, z5.d, z2.d
    b.ne    .LBB0_1
// %bb.2:                               
    faddv   d0, p0, z0.d
    faddv   d1, p0, z1.d

Although the instruction sequence on the right is shorter, needing two ld2d instructions rather than four ld1d and avoiding the two uzp instructions, the use of fcmla brings significant benefit over using four fmla instructions. Benchmarking with micro-kernels containing various complex expressions show an observable 10-20% speedup on Neoverse V1 systems.

When the system supports it, ACfL 23.04 generates SVE FCMLA/FCADD instructions at -Ofast for loops containing any combination of additions and multiplications on complex number variables. These can be from the types provided by C and C++ in <complex.h> or <complex>, or a custom-written type, for example a struct of two doubles. In addition, expressions using std::conj are supported as well as reduction loops using fadd, as per our example above. For Fortran code, using COMPLEX types works similarly. The instructions can also be emitted in loops with control flow and vectorization.

Benchmarking

Arm Compiler for Linux 23.04 brings performance improvements on a number of workloads. The two graphs below show performance of SPEC2k17 and the RAJA Performance Suite, both benchmarked on Neoverse V1 systems. Results are compared to our previous ACfL release.

ACfL performance comparison on SPEC 2017 .

ACfL performance comparison on RAJA

ACfL 23.04 shows a 5% geomean improvement over ACfL 22.1 for Spec2k17 Rate, with double-digit percentage improvements in bwaves and mcf. The RAJA Performance Suite is a companion project to the RAJA C++ performance portability abstraction library. The Performance Suite is designed to explore performance of loop-based computational kernels found in HPC applications. The suite is comprised of many small kernels that can be quite sensitive to changes in compiler code generation. So the graph shows only the spread of improvements and regressions greater than 5%. Overall, ACfL 23.04 shows a 38% geomean improvement over 22.1 performance.

The final graph shows ACfL performance over a number of industry standard applications running over 64 cores on an AWS c7g.16xlarge. The data show ACfL 23.04 delivers many good improvements over ACfL 22.1.

ACfL performance comparison, selected HPC Apps

Arm Performance Libraries 23.04

New sparse linear algebra functions

The Arm Performance Libraries 23.04 release sees the introduction of a new set of routines for sparse linear algebra. The new functionality is available in both C and Fortran along with examples and full documentation. A separate blog has been produced to introduce the new set, which includes:

Sparse matrix functionality:
Triangular matrix solve: armpl_spsv_exec_*
Introduction of a new sparse vector type, armpl_spvect_t. Routines for operations on sparse vectors:
Dot product: armpl_spdot*_exec_*
AXPBY (scaled, summed vectors): armpl_spaxpby_exec_*, armpl_spwaxpby_exec_*
Plane rotation: armpl_sprot_exec_*
Utilities: armpl_spvec_gather_exec_*, armpl_spvec_scatter_exec_*

These routines are in addition to the existing SpMV and SpMM functionality described previously.

Performance improvements:

Arm PL on Graviton 3E

Arm PL has been tuned to give good performance on AWS's HPC-focused Graviton 3E instances across a range of problem sizes and thread counts. The benchmark results below left were gathered using OMP_NUM_THREADS=64, and compared against current versions of OpenBLAS and BLIS. For larger problems Arm PL achieves up to 80% of theoretical peak performance of a whole 64-core instance. For smaller problems Arm PL will throttle back to use fewer threads if necessary to give better performance. This kind of tuning has been applied to all of the BLAS functions in Arm PL. The graph on the right shows the scaling of Arm PL, BLIS and OpenBLAS for a single problem size with increasing numbers of threads.

Arm PL 23.04 parallel DGEMM performance on Graviton 3E Arm PL 23.04 parallel scaling of DGEMM performance on Graviton 3E

ArmPL: 23.04 build compatible with ACfL
OpenBLAS: 2158dc built using "TARGET=NEOVERSEV1 USE_OPENMP=1 NUM_THREADS=64" with gcc 12.2
BLIS: 60f363 built using "--enable-cblas -t openmp armsve" with gcc 12.2

BLAS and LAPACK improvements

The 23.04 release of Arm PL includes significant performance improvements for most of the BLAS and some of the key LAPACK routines. The development team tracks the performance of a wide set of BLAS routines and a selection of important LAPACK routines, for small O(10), medium O(100) and large O(1000) problem sizes, in both serial (1 thread) and parallel (8 threads). The results below demonstrate how performance has improved generally since the last major release as a consequence of improving implementations and additional tuning work. The data incorporates over 5000 individual benchmark cases. In addition results are shown comparing Arm PL 23.04 against latest OpenBLAS and BLIS performance on these same cases.

Arm PL 23.04 performance across BLAS and LAPACK cases

ArmPL: 23.04 build compatible with ACfL
OpenBLAS: 8d6813 built using "TARGET=NEOVERSEV1 USE_OPENMP=1 NUM_THREADS=64" with gcc 12.2
BLIS: 60f363 built using "--enable-cblas -t openmp armsve" with gcc 12.2

Fast Fourier Transforms

FFT performance has improved in the 23.04 release over all transform lengths. Arm PL now provides some very significant performance gains over FFTW 3.3.10 on Graviton 3, especially for single-precision complex-to-complex transforms. Results for 1-dimensional transforms up to N=1024 are given below, where points above the orange line indicate cases that execute faster with Arm PL, with a mean speedup of 2.4x

Arm PL 23.04 performance for FFT cases compared to FFTW

ArmPL: 23.04 build compatible with ACfL
FFTW: 3.3.10 built with "--enable-neon --enable-fma --enable-single --enable-shared --enable-openmp" with gcc 12.2

Optimized libm functions

Libamath is the part of Arm Performance Libraries which provides optimized scalar and vector implementations of basic mathematical functions, as found in math.h. The functions in libamath are used by Arm Compiler for Linux whenever possible; the compiler automatically links to the libamath library and can generate vectorized calls to these functions. For information on how to call vector routines directly see our article on using Arm vector math routines.

Much emphasis has been put on improving Neon performance of math routines in the 23.04 release. As shown in the graphs below, a broad range of Neon routines have been optimized to give better performance while providing under 4 ULPs accuracy. This includes single and double-precision regular and inverse hyperbolic routines, inverse trigonometric routines, expm1, log1p, exp2, log2, and cbrt. The results shown here were generated on an AWS Graviton 2 and are displayed as speedup (higher is better) against the 22.1 release.

Arm PL 23.04 performance of math.h functions

A few SVE routines were also improved, most noticeably SVE log2 (3x speedup), log2f (4x) and exp2f (1.7x), where speedup was evaluated on an AWS Graviton3.

Shorter routines like trunc(f), fabs(f), modff, and rint have also improved by factors of around 1.5x over the 22.1 release.

Explore HPC Blogs

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog