This blog highlights new features and performance improvements that come with Arm Allinea Studio 21.1. Arm Allinea Studio (AAS) is a combination of Forge (DDT, MAP, and performance reports), Arm Compiler for Linux (ACfL) and Arm Performance Libraries (ArmPL).
Arm Compiler for Linux is our "vendor compiler" package intended for HPC and Cloud workloads. It includes C, C++ and Fortran compilers, as well as Arm Performance Libraries.
This release of ACfL includes and upgrade from LLVM version 11 to version 12. Obviously LLVM itself is constantly improving, so we would expect a general improvement in functionality, performance, and stability from version 11 to version 12. It is never quite as simple as that with compilers though as the number of things a compiler is asked to do with nearly infinite combinations of possible inputs are often in conflict with each other, for example, quickly compile my code, make sure it is correct and make sure it runs fast in all situations. Internal benchmarks of LLVM12-based ACfL show 1-2% improvement across several industry standard benchmarks as well as a couple of minor performance regressions, but overall a positive result.
Although Arm Compiler for Linux has well-established Scalable Vector Extensions (SVE) support for both Arm C Language Extensions (ACLE) code and autovectorized code, both use the Vector Length Agnostic (VLA) paradigm that SVE naturally allows. VLA SVE code can be compiled once and will vectorize well on any SVE implementation you run it on. The Vector Length Specific (VLS) SVE paradigm, that is targeting SVE instructions but for a fixed vector width specified at compile time, is new to Arm Compiler for Linux. The intended application for this type of SVE code is where a fixed vector width is intrinsic to the code, perhaps even required by the algorithm, or the code is heavily optimized for fixed width vectors. Sometimes, where the user does not mind having to recompile for each new vector width they encounter, VLS SVE can be preferable to VLA. In some cases it also offers an easier migration path from (inherently fixed width) Neon targets.
VLS SVE ACLE support was completed in LLVM 12, so ACfL inherits this feature as a function of the upstream merge. The new feature makes available a new ACLE feature macro and type attribute arm_sve_vector_bits that can be used to specialize the normal SVE ACLE datatypes to a specific vector width. The value of this width is set by an equivalent compiler command-line option -msve-vector-bits=<number>. The user can then use these types in normal SVE ACLE code, but make assumptions about the vector width.
The following example makes assumptions about the vector width and how many svfloat64_t variables are processed at once.
include <arm_sve.h> #if _ARM_FEATURE_SVE_BITS==256 typedef svfloat64_t vec_float_t attribute ((arm_sve_vector_bits(256))); #else #error Only -msve-vector-bits=256 is supported #endif void vls_slp(double al, double a2, double a3, double a4, double bl, double b2, double b3, double b4, double *A) { svbool_t pg = svptrue_b64(); double a[4] = {al, a2, a3, a4}; vec_float_t vec_a = svldl_f64(pg, &a[0]); double b[4] = 01, b2, b3, b4}; vec_float_t vec_b = svldl_f64(pg, &b[0]); vec_float_t add_res = svadd_f64_z(pg, vec_a, vec_b); vec_float_t mul_res = svmul_f64_z(pg, vec_a, add_res); vec_float_t div_res = svdiv_f64_z(pg, mul_res, vec_b); systl_f64(pg, A, div_res); }
It must be built with -msve-vector-bits=256 to produce the code below:
vls_slp: sub sp, sp, #0x40 stp d0, dl, [sp, #32] stp d2, d3, [sp, #48] ptrue pO.d add x8, sp, #0x20 ldld {z0.d}, p0/z, [x8] stp d4, d5, [sp] stp d6, d7, [sp, #16] mov x8, sp ldld {z1.d}, p0/z, [x8] movprfx z2, z0 fadd z2.d, p0/m, z2.d, zl.d fmul zO.d, p0/m, zO.d, z2.d fdiv zO.d, p0/m, zO.d, zl.d stld {z0.d}, p0, [x0] add sp, sp, #0x40 ret
The generated code will not produce the intended result if the hardware SVE vector length is not 256 bits. If the vector length is shorter, the vector code does not cover all the lines of code from the original function. If the vector length is longer, then the store might write to unallocated memory. It is tuned to the vector width of 256.
Adding this feature to ACfL allows it to compile existing VLS SVE ACLE codes like Grid and Gromacs. The latest version of Gromacs shows a 15% performance gain from switching to VLS SVE from Neon on an A64FX. Meanwhile, upstream LLVM's support for VLS SVE vectorization is developing well and will appear in a future version of Arm Compiler for Linux.
A useful feature of the LLVM optimizer are its optimization remarks. These are diagnostic messages that can be enabled with the -Rpass option(s) and indicate to the user what the compiler's optimizer is doing with their code as it passes through the compiler pipeline. Most usefully, the user can enable remarks that point to where the compiler was unable to apply key optimizations in the code and the reasons for why. This can help the motivated user trying to hand-optimize their code to run as fast as it can on their machine by rewriting hot parts to a form that the compiler can optimize.
Arm Compiler for Linux has a number of downstream improvements to optimization remarks including improved remark texts and also additional remarks. Sometimes, additional analysis code is needed to be able to emit a high-quality remark. Recently, Arm has contributed back a number of remark improvements to upstream LLVM to help all users of LLVM-based compilers to better hand optimize their codes. We have enhanced five remarks that warn users when loop dependencies (that is, between two iterations of the same loop) are preventing vectorization.
For example, this code now generates the following remark:
void test_backward_dep(int n, double *A) { for (int i = 1; i <= n – 3; i += 3) { A[i] = A[i-1]; A[i+1] = A[i+3]; } } > armclang -O3 -Rpass-missed=loop-accesses test.cpp > test.cpp:4:14: remark: loop not vectorized: Backward loop carried data dependence. Memory location is the same as accessed at line test.cpp:3:5 A[i+1] = A[i+3];
Our modifications expand the remarks to be specific about the kind of dependency preventing optimization and make it refer to the source code instead of elements of the LLVM intermediate representation. The new phrasing also points to both memory accesses involved in the dependency, instead of just the access which happened to raise the remark.
Arm Performance Libraries is our "vendor" maths library solution. It is primarily deployed in HPC and cloud use-cases as the performant solution for vector and matrix computations, primarily around dense data. In addition, ArmPL provides solutions for sparse linear algebra, FFTs, and libm functions. It is available both as a free standalone product and as part of Arm Compiler for Linux in Arm Allinea Studio.
Since the last release we have been continuing to improve our implementations of BLAS (Basic Linear Algebra Subroutines) functions, with a particular focus on improving how we handle small problems. We have noticed an increase in how important solving lots of small problems can be for many applications. Sometimes, this comes from more fine-grained parallelism at the application level, and in other cases it is because the library is being used to tackle new types of workloads beyond traditional HPC (for example, in data science in the cloud, linking Python packages numpy and scipy to Arm PL). Handling small problems effectively means cutting overheads such as setting up multiple threads unnecessarily, and minimizing the amount of work on padded data we perform in within our optimized kernels.
We use a unified framework within the library for developing dense linear algebra functions, which allows us to have a consistent approach across different functions. In improving the design and use of our framework we have begun to see improvements for smaller problems whilst at the same time laying the foundations for the introduction of new high performance kernels, such as SVE-enabled BLAS functions in upcoming Arm cores. For example, in the 21.1 release we have restructured our level 3 BLAS matrix rank-update functions (?SYRK, ?SYR2K) and this has led to some good performance improvements, shown in the following graph for ?SYRK. These benchmarks were run using a single core of a Graviton 2 and Neoverse N1 AWS instance (c6g.2xlarge). The improved performance, especially for small cases, are compared here against the latest results for open-source alternatives.
In the 21.1 release, we have added SVE kernels for some of the interleave-batch functions we introduced in the 21.0 release. Interleave-batch functions help efficiently process large numbers of small matrices which have applicability in many fields such as Image Signal Processing, Computational Fluid Dynamics, Hydrodynamics, and deep learning. For more information about the design of these functions, see this blog on the topic. The changes here include some improvements to our Neon kernels and new SVE kernels for general matrix-matrix multiplication (armpl_dgemm_interleave_batch), triangular matrix-matrix multiplication (armpl_dtrmm_interleave_batch) and triangular matrix solve (armpl_dtrsm_interleave_batch). The SVE kernels for these functions are optimized to perform best when the interleaving factor, ninter, is a multiple of 8 times the SVE vector length (although the functions work correctly with any value of ninter). For example, when running on A64FX with a vector length of 512-bits we have a vector length of 8 elements for double-precision real data, which means that the recommended value of ninter is 64. For our Neon kernels, a value of ninter=16 generally produced the best performance.
The following graphs show the performance of our interleave-batch functions in Arm PL 21.1 when operating on a batch of 32,768 matrices for a selection of square matrix dimensions, using a single A64FX core. The results show the speedup gained over using repeated calls into equivalent BLAS functions, using Arm PL's SVE BLAS implementations. In each case, we adapted the interleave-batch layout to make use of the recommended interleaving factor, that is, for Neon ninter=16 and nbatch=2048, and for SVE ninter=64 and nbatch=512. The DGEMM results show that the interleave-batch approach is only not worthwhile for matrices of size 20 on A64FX. In all other cases the interleave-batch approach is multiple times faster than BLAS, and our new SVE kernels give significant performance improvements compared with the Neon equivalents.
In addition to working on performance improvements the Arm PL team has also been working to improve accessibility. There are a few different strands to this work.
First, we started the groundwork for a unified Arm PL, which is a project to build a single library that contains optimizations for both Neon and SVE cores. The 21.0 release of Arm PL moved from producing a separate library for each mircoarchitecture we tuned for to producing two: one for all Neon-only cores and one for SVE-capable cores. At 22.0, we hope to merge those two libraries into the unified Arm PL which contains both Neon and SVE (both hand-tuned and auto-vectorized by the compiler) side by side, making the library package smaller, simpler to work with and enabling ISVs to link to a single copy of the library which will work optimally on any of the AArch64 cores we support.
Alongside this, we are also working to make the library portable to other platforms (not just Linux). Exploratory work is ongoing related to Windows and Mac support. Stay tuned.
Allowing Arm PL to be packaged for use in a Python wheel with Numpy and Scipy comes with a requirement that the package can be installed in any Linux distribution. Our library builds are done separately for several popular Linux distributions (for example, see the downloads list for free Arm PL), and each build is tied to a particular compiler and the associated runtime libraries, including libgfortran and glibc. The Numpy developers pointed us to the manylinux standard for building Python packages portably, and we have successfully built a serial version of Arm PL which is compliant. As part of this work we also had to remove the dependence Arm PL has on the Fortran runtime library (to avoid Python users having to separately download libgfortran), and we did this by hiding the Fortran runtime objects we require within libarmpl. The result is a portable serial build of the library which we are ready to make available to be packaged with Numpy and Scipy.
We have updated our tools for HPC application developers across platforms with the release of Arm Forge 21.1. Amongst other developments, the release provides enhanced application profiling through the visualization of GPU Memory Transfers.
GPU memory transfers can consume a lot of interconnect bandwidth. Users who are optimizing software for different GPUs can benefit from visualizing this traffic, particularly at large scale. Today, Arm MAP displays time spent waiting for accelerators but does not easily represent which part of the time spent is on actual GPU processing versus memory transfers. The new memory transfer profiling feature in MAP helps users distinguish between useful GPU compute versus memory transfer overhead, giving the user a hint when and where to optimize their software.
The new MAP feature even helps users distinguish between different types of transfers:
MAP can optionally track and display the stack trace and source code locations from which memory transfers were made.
Click to Enlarge
Review latest Arm Allinea Studio release notes here.Download latest version of Arm Allinea Studio here