Arm Allinea Studio is a tool suite consisting of the following components:
This blog provides an overview of Arm Allinea Studio 21.0.
Arm MAP takes advantage of the Arm Statistical Profiling Extension (SPE) to provide more precise performance profiles of HPC applications. Hardware performance monitors (HPM) have existed within CPUs for years, however HPM data can be voluminous and interrupt latency can result in substantial “skid.” This skid results in the program stopping beyond the point where the event happened, making it very difficult to associate performance events with the application code that caused them. Conversely SPE packets are generated by the core itself at user-defined intervals, resulting in a precise application location for each event. The result is more detailed insight into when and where events are happening.
SPE is an optional Armv8-A architecture extension available since Armv8.2-A. Examples of infrastructure CPUs supporting SPE are Neoverse-E1, Neoverse N1 and Neoverse V1. Graviton 2 and Ampere Altra are examples of Neoverse-based processors. As Neoverse N1 and Neoverse V1 start to dominate the cloud, having precise performance analysis capability is key to enabling developers of HPC applications to tune their software for maximum throughput, lowering runtime costs for users. Later this year we start to see SVE-enabled silicon in the cloud and having a scalable profiling solution is even more important for optimizing HPC workloads.
SPE support in MAP provides a statistical view of the performance characteristics of executed instructions and micro-operations, creating a performance profile correlated with software which can be used by developers to optimize HPC code. With MAP support of SPE, developers can not only track the location and timing of cache misses, branch mispredicts and TLB walks, but they can now profile details of the micro-architecture like number of cycles taken executing an instruction or number of cycles taken translating the virtual address of the instruction.
Figure 1: Example of how MAP utilizes the SPE to produce a profile of L1 data cache refill activity.
Arm Compiler for Linux is based on open-source LLVM, Clang, and Flang technologies. The 21.0 release upgrades this component from LLVM version 9 to LLVM version 11. This brings the varied benefits of 12 months of development and results in improved correctness, stability, and performance. See the following graph for a comparison of Arm Compiler for Linux 21.0 with the previous Arm Compiler for Linux 20.3 on the Spec 2017 Rate benchmark suite (Integer and FP). You can see most benchmarks show a modest improvement, with some large improvements to 521.wrf and 538.imagick.
Figure 2: ACfL 21.0 demonstrates significant performance uplift for 521.wrf and 538.imagick over the previous ACfL version.
How users access optimized functions for different microarchitectures has been improved. Instead of providing separate libraries for users to link to for each of the optimized-for microarchitectures, we now provide just two libraries:
These two libraries are now capable of detecting which optimized functions to select at runtime. This will make life easier for users wanting to, for example, link to Arm Performance Libraries and achieve performance portability across ThunderX2 and Neoverse N1 cores. There will be no need to link to separate libraries in this case, the same binary will work optimally on both types of cores.
There has been a lot of interest in batched linear algebra over the past few years. A number of interfaces have emerged for solving large batches of problems in alternative ways to looping over BLAS and LAPACK routines. Arm Performance Libraries (Arm PL) already contains some support for batched matrix-matrix multiplication (for example, dgemm_batch). These functions allow users to provide pointers to each matrix involved, and define groups of different problem parameters (matrix sizes, transpose options). These techniques can be useful in some situations but adds complexity to users of the interfaces and requires a very wide variety of potential problem types to be optimized within the library.
A very common use case for batched linear algebra is simply solving large batches of small problems all with the same parameters (matrix sizes, transpose options). In this case, it is possible to gain significant performance improvements by ensuring usage of (Neon or SVE) vector instructions to operate on different problems in different vector lanes. This requires interleaving matrices in memory, so that corresponding matrix elements across ninter matrices can be accessed using contiguous loads. We have added such "interleave-batch" functions for a selection of key double-precision, real problems in the 21.0 release. The functions allow for nbatch batches of ninter interleaved matrices (that is, nbatch * ninter matrices in total) to be processed efficiently, and workloads are parallelized with OpenMP.
The functionality we provide allows for alternatives to repeated calls to BLAS functions. Alternatives include:
We also provide utility functions for packing and unpacking matrices to and from the interleave-batch layout.
The interleave-batch functions are fully documented with an introduction, and an example is provided. A separate blog post with more details on the design, usage, and performance of these functions is coming soon. Here, we show the performance improvements which can be achieved for QR factorization, which always outperforms repeated calls to the equivalent LAPACK routine for problems up to a dimension of 60.
Figure 3: Arm PL 21.0 – QR factorization with interleave-batch function compared with LAPACK
Real-to-real (r2r) transforms in the FFTW3 interface allow users to perform four kinds of Discrete Cosine Transform for even real sequences, four kinds of Discrete Sine Transform for odd real sequences and the Discrete Hartley Transform (DHT). Real-to-real transforms also allow Fast Fourier Transforms using the Half Complex format (in which the real and imaginary components are provided separately). In each case, we now provide fully tested implementations for the basic, advanced and guru r2r interfaces, for C, Fortran 77 and Fortran 2003 interfaces. These functions are fully documented in the Arm PL Reference Guide. As with other FFTW3 functions, users do not need to change their code compared with using FFTW3 itself. Our r2r implementations make use of the vectorized (using Neon or SVE, depending on library selected) FFT kernels, providing O(n log n) performance scaling. The addition of these functions to Arm PL means that we now provide a complete implementation of the transforms in FFTW3.
Returning to non-batched BLAS, with matrix-matrix multiplication, where only the upper or lower triangular part of the output matrix is desired, significant performance improvements can be achieved by providing a dedicated routine rather than leaving users to arrange for this using a separate buffer. We introduce BLAS extension routines ?GEMMT for this purpose, for all data types (single and double-precision, real and complex). These functions are fully documented in the Arm PL Reference Guide and each has been parallelized with OpenMP. If you compare simply calling DGEMMT to update only the required part of the destination matrix versus calling DGEMM using a separate buffer for the output matrix which is, then copied into the required destination, you can see around twice the performance when using DGEMMT.
Figure 4: DGEMMT compared to DGEMM plus copy
We have worked on reducing overheads in BLAS routines, which help improve performance for small problems. In addition to reducing overheads, in level 1 BLAS we have faster? DOT dot products, and in level 2 BLAS we have added new, faster and multithreaded versions of some banded level 2 BLAS routines: symmetric and Hermitian matrix-vector multiplication (?SBMV and ?HBMV), as well as triangular matrix-vector multiplication (?TBMV). For level 3 BLAS, there are improvements for rectangular SGEMM and DGEMM problems featuring small values of N and K. Key LAPACK routines for LU factorization (?GETRF) and Cholesky factorization (?POTRF) also benefit from reduced overheads and an increase in performance for small problems.
In previous releases of Arm Compiler for Linux, we have had a somewhat confusing module system, with many modules with similar, long names. The 21.0 release introduces a brand-new set of modulefiles which are more user-friendly and flexible. We have renamed the modules to make them easier to scan and now only show the modules that are relevant to the user at the time.
Arm Performance Libraries modules only become available once the appropriate compiler has been loaded, meaning that they do not clutter up the module list until they are actually usable. The modules have been made to be more flexible, with the whole directory able to move to a location of the user's choice without editing the files themselves.
Figure 5: Simplified module system in ACfL 21.0.
Finally, if the more modern Lmod is available on the system, the new modules take advantage of it and use the "family" declaration. This prevents different compiler modules being loaded at once and allowing the dynamic swapping of the libraries when the compilers are swapped.
Arm Allinea Studio 21.0 is full of new features for optimizing HPC workloads on Arm, adding SPE profiling for fine grained performance analysis and new compiler and library updates for improved performance of your HPC application. Try it now.
Download and installation instructions for Arm Allinea Studio: https://developer.arm.com/tools-and-software/server-and-hpc/arm-allinea-studio/installation/single-page.