Arm Compiler for Linux (ACfL) provides a complete compiling environment for natively developing and tuning your server and HPC applications on Arm-based platforms. It includes:
In this blog, we explore what is new in this first major release of 2023.
The Arm Compiler for Linux roadmap is changing to two major releases of ACfL per year each with an update to the base LLVM version. This cadence is to ensure that ACfL keeps up with state-of-the-art LLVM technology, especially for AArch64 and SVE-optimized code generation, and allows users to access the full performance of their Arm-based systems sooner. To signify this change, ACfL releases will now be numbered as YY.MM denoting year and month of release. This means 23.04 is out now, and 23.10 will be the next release in October and will be based on LLVM 17. Arm Performance Libraries numbering also matches this convention.
A summary of the key features in this release:
Full release notes can be found on developer.arm.com.
.
To support Complex, Arm introduced Armv8.3-A instructions number addition and multiply-accumulate operations. ACfL supports these instructions as a vectorization target at -Ofast if supported by the target system.
-Ofast
Take the following vector reduction loop which multiplies elements of two arrays of complex numbers v and w and accumulates the result in complex variable x .
v
w
x
Example 1: Reduction loop on array of complex types
#include <complex> using namespace std; complex <double> foo(complex < double > v[ LEN ], complex < double > w[ LEN ]) { complex < double > x; for (int i = 0; i < LEN; ++i) x += v[i]*w[i]; return x; }
Compiling this example with armclang using -Ofast and targeting a system like Neoverse V1, which supports SVE and the Complex number extension gives the output on the left. Previous versions of ACfL would emit the sequence on the right.
.LBB0_1: add x14, x0, x11 add x15, x1, x11 ld1d { z0.d }, p0/z, [x14] ld1d { z3.d }, p0/z, [x14, #1, mul vl] ld1d { z4.d }, p0/z, [x15] ld1d { z5.d }, p0/z, [x15, #1, mul vl] add x11, x11, x13 subs x12, x12, x10 fcmla z2.d, p0/m, z4.d, z0.d, #0 fcmla z1.d, p0/m, z5.d, z3.d, #0 fcmla z2.d, p0/m, z4.d, z0.d, #90 fcmla z1.d, p0/m, z5.d, z3.d, #90 b.ne .LBB0_1 // %bb.2: uzp1 z0.d, z2.d, z1.d uzp2 z1.d, z2.d, z1.d faddv d0, p0, z0.d faddv d1, p0, z1.d
.LBB0_1: add x14, x0, x10 add x15, x1, x10 add x10, x10, x13 subs x12, x12, x9 ld2d { z2.d, z3.d }, p0/z, [x14] ld2d { z4.d, z5.d }, p0/z, [x15] fmla z0.d, p0/m, z4.d, z2.d fmla z1.d, p0/m, z4.d, z3.d fmls z0.d, p0/m, z5.d, z3.d fmla z1.d, p0/m, z5.d, z2.d b.ne .LBB0_1 // %bb.2: faddv d0, p0, z0.d faddv d1, p0, z1.d
Although the instruction sequence on the right is shorter, needing two ld2d instructions rather than four ld1d and avoiding the two uzp instructions, the use of fcmla brings significant benefit over using four fmla instructions. Benchmarking with micro-kernels containing various complex expressions show an observable 10-20% speedup on Neoverse V1 systems.
ld2d
ld1d
uzp
fcmla
fmla
When the system supports it, ACfL 23.04 generates SVE FCMLA/FCADD instructions at -Ofast for loops containing any combination of additions and multiplications on complex number variables. These can be from the types provided by C and C++ in <complex.h> or <complex>, or a custom-written type, for example a struct of two doubles. In addition, expressions using std::conj are supported as well as reduction loops using fadd, as per our example above. For Fortran code, using COMPLEX types works similarly. The instructions can also be emitted in loops with control flow and vectorization.
<complex.h>
<complex>
std::conj
fadd
COMPLEX
Arm Compiler for Linux 23.04 brings performance improvements on a number of workloads. The two graphs below show performance of SPEC2k17 and the RAJA Performance Suite, both benchmarked on Neoverse V1 systems. Results are compared to our previous ACfL release.
ACfL 23.04 shows a 5% geomean improvement over ACfL 22.1 for Spec2k17 Rate, with double-digit percentage improvements in bwaves and mcf. The RAJA Performance Suite is a companion project to the RAJA C++ performance portability abstraction library. The Performance Suite is designed to explore performance of loop-based computational kernels found in HPC applications. The suite is comprised of many small kernels that can be quite sensitive to changes in compiler code generation. So the graph shows only the spread of improvements and regressions greater than 5%. Overall, ACfL 23.04 shows a 38% geomean improvement over 22.1 performance.
The final graph shows ACfL performance over a number of industry standard applications running over 64 cores on an AWS c7g.16xlarge. The data show ACfL 23.04 delivers many good improvements over ACfL 22.1.
The Arm Performance Libraries 23.04 release sees the introduction of a new set of routines for sparse linear algebra. The new functionality is available in both C and Fortran along with examples and full documentation. A separate blog has been produced to introduce the new set, which includes:
These routines are in addition to the existing SpMV and SpMM functionality described previously.
Arm PL has been tuned to give good performance on AWS's HPC-focused Graviton 3E instances across a range of problem sizes and thread counts. The benchmark results below left were gathered using OMP_NUM_THREADS=64, and compared against current versions of OpenBLAS and BLIS. For larger problems Arm PL achieves up to 80% of theoretical peak performance of a whole 64-core instance. For smaller problems Arm PL will throttle back to use fewer threads if necessary to give better performance. This kind of tuning has been applied to all of the BLAS functions in Arm PL. The graph on the right shows the scaling of Arm PL, BLIS and OpenBLAS for a single problem size with increasing numbers of threads.
ArmPL: 23.04 build compatible with ACfLOpenBLAS: 2158dc built using "TARGET=NEOVERSEV1 USE_OPENMP=1 NUM_THREADS=64" with gcc 12.2BLIS: 60f363 built using "--enable-cblas -t openmp armsve" with gcc 12.2
The 23.04 release of Arm PL includes significant performance improvements for most of the BLAS and some of the key LAPACK routines. The development team tracks the performance of a wide set of BLAS routines and a selection of important LAPACK routines, for small O(10), medium O(100) and large O(1000) problem sizes, in both serial (1 thread) and parallel (8 threads). The results below demonstrate how performance has improved generally since the last major release as a consequence of improving implementations and additional tuning work. The data incorporates over 5000 individual benchmark cases. In addition results are shown comparing Arm PL 23.04 against latest OpenBLAS and BLIS performance on these same cases.
ArmPL: 23.04 build compatible with ACfLOpenBLAS: 8d6813 built using "TARGET=NEOVERSEV1 USE_OPENMP=1 NUM_THREADS=64" with gcc 12.2BLIS: 60f363 built using "--enable-cblas -t openmp armsve" with gcc 12.2
FFT performance has improved in the 23.04 release over all transform lengths. Arm PL now provides some very significant performance gains over FFTW 3.3.10 on Graviton 3, especially for single-precision complex-to-complex transforms. Results for 1-dimensional transforms up to N=1024 are given below, where points above the orange line indicate cases that execute faster with Arm PL, with a mean speedup of 2.4x
ArmPL: 23.04 build compatible with ACfLFFTW: 3.3.10 built with "--enable-neon --enable-fma --enable-single --enable-shared --enable-openmp" with gcc 12.2
Libamath is the part of Arm Performance Libraries which provides optimized scalar and vector implementations of basic mathematical functions, as found in math.h. The functions in libamath are used by Arm Compiler for Linux whenever possible; the compiler automatically links to the libamath library and can generate vectorized calls to these functions. For information on how to call vector routines directly see our article on using Arm vector math routines.
math.h
Much emphasis has been put on improving Neon performance of math routines in the 23.04 release. As shown in the graphs below, a broad range of Neon routines have been optimized to give better performance while providing under 4 ULPs accuracy. This includes single and double-precision regular and inverse hyperbolic routines, inverse trigonometric routines, expm1, log1p, exp2, log2, and cbrt. The results shown here were generated on an AWS Graviton 2 and are displayed as speedup (higher is better) against the 22.1 release.
expm1
log1p
exp2
log2,
cbrt
A few SVE routines were also improved, most noticeably SVE log2 (3x speedup), log2f (4x) and exp2f (1.7x), where speedup was evaluated on an AWS Graviton3.
log2
log2f
exp2f
Shorter routines like trunc(f), fabs(f), modff, and rint have also improved by factors of around 1.5x over the 22.1 release.
trunc
f
fabs
modff,
rint
[CTAToken URL = "https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog" target="_blank" text="Explore HPC Blogs" class ="green"]