New features for optimizing HPC workloads in Arm Allinea Studio 21.0

Daniel Owens

April 28, 2021

8 minute read time.

About Arm Allinea Studio (AAS)

Arm Allinea Studio is a tool suite consisting of the following components:

Arm Forge – cross platform (AArch64, x86, Power) debug and performance analysis tools, including
- MAP profiler
- DDT debugger
- Performance Reports profile analyzer
Arm Compiler for Linux (Arm architecture only) - Linux user space compiler for C, C++ and Fortran targeting AArch64.
Arm Performance Libraries (Arm architecture only) – AArch64-optimized BLAS, LAPACK, FFT routines.

This blog provides an overview of Arm Allinea Studio 21.0.

Arm Forge 21.0

MAP

Arm MAP takes advantage of the Arm Statistical Profiling Extension (SPE) to provide more precise performance profiles of HPC applications. Hardware performance monitors (HPM) have existed within CPUs for years, however HPM data can be voluminous and interrupt latency can result in substantial “skid.” This skid results in the program stopping beyond the point where the event happened, making it very difficult to associate performance events with the application code that caused them. Conversely SPE packets are generated by the core itself at user-defined intervals, resulting in a precise application location for each event. The result is more detailed insight into when and where events are happening.

SPE is an optional Armv8-A architecture extension available since Armv8.2-A. Examples of infrastructure CPUs supporting SPE are Neoverse-E1, Neoverse N1 and Neoverse V1. Graviton 2 and Ampere Altra are examples of Neoverse-based processors. As Neoverse N1 and Neoverse V1 start to dominate the cloud, having precise performance analysis capability is key to enabling developers of HPC applications to tune their software for maximum throughput, lowering runtime costs for users. Later this year we start to see SVE-enabled silicon in the cloud and having a scalable profiling solution is even more important for optimizing HPC workloads.

SPE support in MAP provides a statistical view of the performance characteristics of executed instructions and micro-operations, creating a performance profile correlated with software which can be used by developers to optimize HPC code. With MAP support of SPE, developers can not only track the location and timing of cache misses, branch mispredicts and TLB walks, but they can now profile details of the micro-architecture like number of cycles taken executing an instruction or number of cycles taken translating the virtual address of the instruction.

SPE screenshot

Figure 1: Example of how MAP utilizes the SPE to produce a profile of L1 data cache refill activity.

DDT

Added ability to sort by column in "Cross-Thread" and "Cross-Process Comparison" views
Added GDB 10 as an optional debugger

Performance reports

Incorporated the latest Perf PMU events from Linux 5.11

General enhancements

Forge for Linux is now distributed as a single package per architecture
Added support for GNU Compiler versions 9 and 10
Added support for Open MPI version 4.1
Added support for NVIDIA HPC Compiler version 20.9

Arm Compiler for Linux 21.0

Updated to LLVM11

Arm Compiler for Linux is based on open-source LLVM, Clang, and Flang technologies. The 21.0 release upgrades this component from LLVM version 9 to LLVM version 11. This brings the varied benefits of 12 months of development and results in improved correctness, stability, and performance. See the following graph for a comparison of Arm Compiler for Linux 21.0 with the previous Arm Compiler for Linux 20.3 on the Spec 2017 Rate benchmark suite (Integer and FP). You can see most benchmarks show a modest improvement, with some large improvements to 521.wrf and 538.imagick.

SPEC2017 Benchmark

Figure 2: ACfL 21.0 demonstrates significant performance uplift for 521.wrf and 538.imagick over the previous ACfL version.

Arm Performance Libraries 21.0

Auto-selecting the optimal function

How users access optimized functions for different microarchitectures has been improved. Instead of providing separate libraries for users to link to for each of the optimized-for microarchitectures, we now provide just two libraries:

A library which contains optimizations for all Neon-only microarchitectures we target (currently ThunderX2, Neoverse N1 or a generic Neon core)
A library for SVE-capable microarchitectures (currently A64FX or a generic SVE core)

These two libraries are now capable of detecting which optimized functions to select at runtime. This will make life easier for users wanting to, for example, link to Arm Performance Libraries and achieve performance portability across ThunderX2 and Neoverse N1 cores. There will be no need to link to separate libraries in this case, the same binary will work optimally on both types of cores.

New support for batched linear algebra

There has been a lot of interest in batched linear algebra over the past few years. A number of interfaces have emerged for solving large batches of problems in alternative ways to looping over BLAS and LAPACK routines. Arm Performance Libraries (Arm PL) already contains some support for batched matrix-matrix multiplication (for example, dgemm_batch). These functions allow users to provide pointers to each matrix involved, and define groups of different problem parameters (matrix sizes, transpose options). These techniques can be useful in some situations but adds complexity to users of the interfaces and requires a very wide variety of potential problem types to be optimized within the library.

A very common use case for batched linear algebra is simply solving large batches of small problems all with the same parameters (matrix sizes, transpose options). In this case, it is possible to gain significant performance improvements by ensuring usage of (Neon or SVE) vector instructions to operate on different problems in different vector lanes. This requires interleaving matrices in memory, so that corresponding matrix elements across ninter matrices can be accessed using contiguous loads. We have added such "interleave-batch" functions for a selection of key double-precision, real problems in the 21.0 release. The functions allow for nbatch batches of ninter interleaved matrices (that is, nbatch * ninter matrices in total) to be processed efficiently, and workloads are parallelized with OpenMP.

The functionality we provide allows for alternatives to repeated calls to BLAS functions. Alternatives include:

General and triangular matrix multiplication (armpl_dgemm_interleave_batch, armpl_dtrmm_interleave_batch)
Triangular matrix solve (armpl_dtrsm_interleave_batch)
Cholesky (armpl_dpotrf_interleave_batch), the most common one-sided matrix factorizations in LAPACK
Lower-Upper factorization with threshold pivoting (armpl_dgetrftp_interleave_batch)
Rank-revealing QR with column pivoting (armpl_dgeqrfrr_interleave_batch).

We also provide utility functions for packing and unpacking matrices to and from the interleave-batch layout.

The interleave-batch functions are fully documented with an introduction, and an example is provided. For a more detailed view into how we use interleave-batch functions to efficiently solve large numbers of small linear algebra problems in a way that is optimized for the Arm architecture, see this blog. Below, we show the performance improvements which can be achieved for QR factorization, which always outperforms repeated calls to the equivalent LAPACK routine for problems up to a dimension of 60.

LAPACK Benchmark

Figure 3: Arm PL 21.0 – QR factorization with interleave-batch function compared with LAPACK

Support for real-to-real FFTW3 transforms

Real-to-real (r2r) transforms in the FFTW3 interface allow users to perform four kinds of Discrete Cosine Transform for even real sequences, four kinds of Discrete Sine Transform for odd real sequences and the Discrete Hartley Transform (DHT). Real-to-real transforms also allow Fast Fourier Transforms using the Half Complex format (in which the real and imaginary components are provided separately). In each case, we now provide fully tested implementations for the basic, advanced and guru r2r interfaces, for C, Fortran 77 and Fortran 2003 interfaces. These functions are fully documented in the Arm PL Reference Guide. As with other FFTW3 functions, users do not need to change their code compared with using FFTW3 itself. Our r2r implementations make use of the vectorized (using Neon or SVE, depending on library selected) FFT kernels, providing O(n log n) performance scaling. The addition of these functions to Arm PL means that we now provide a complete implementation of the transforms in FFTW3.

GEMMT routines

Returning to non-batched BLAS, with matrix-matrix multiplication, where only the upper or lower triangular part of the output matrix is desired, significant performance improvements can be achieved by providing a dedicated routine rather than leaving users to arrange for this using a separate buffer. We introduce BLAS extension routines ?GEMMT for this purpose, for all data types (single and double-precision, real and complex). These functions are fully documented in the Arm PL Reference Guide and each has been parallelized with OpenMP. If you compare simply calling DGEMMT to update only the required part of the destination matrix versus calling DGEMM using a separate buffer for the output matrix which is, then copied into the required destination, you can see around twice the performance when using DGEMMT.

DGEMM screenshot

Figure 4: DGEMMT compared to DGEMM plus copy

BLAS performance improvements

We have worked on reducing overheads in BLAS routines, which help improve performance for small problems. In addition to reducing overheads, in level 1 BLAS we have faster? DOT dot products, and in level 2 BLAS we have added new, faster and multithreaded versions of some banded level 2 BLAS routines: symmetric and Hermitian matrix-vector multiplication (?SBMV and ?HBMV), as well as triangular matrix-vector multiplication (?TBMV). For level 3 BLAS, there are improvements for rectangular SGEMM and DGEMM problems featuring small values of N and K. Key LAPACK routines for LU factorization (?GETRF) and Cholesky factorization (?POTRF) also benefit from reduced overheads and an increase in performance for small problems.

Simplified module files

In previous releases of Arm Compiler for Linux, we have had a somewhat confusing module system, with many modules with similar, long names. The 21.0 release introduces a brand-new set of modulefiles which are more user-friendly and flexible. We have renamed the modules to make them easier to scan and now only show the modules that are relevant to the user at the time.

Arm Performance Libraries modules only become available once the appropriate compiler has been loaded, meaning that they do not clutter up the module list until they are actually usable. The modules have been made to be more flexible, with the whole directory able to move to a location of the user's choice without editing the files themselves.

New module system

Figure 5: Simplified module system in ACfL 21.0.

Finally, if the more modern Lmod is available on the system, the new modules take advantage of it and use the "family" declaration. This prevents different compiler modules being loaded at once and allowing the dynamic swapping of the libraries when the compilers are swapped.

Summary

Arm Allinea Studio 21.0 is full of new features for optimizing HPC workloads on Arm, adding SPE profiling for fine grained performance analysis and new compiler and library updates for improved performance of your HPC application. Try it now.

Download and installation instructions for Arm Allinea Studio: https://developer.arm.com/tools-and-software/server-and-hpc/arm-allinea-studio/installation/single-page.

0 comments
0 members are here

Tools, Software and IDEs blog

GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025
Running KleidiAI MatMul kernels in a bare-metal Arm environment

Paul Black

Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
- April 17, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog