Arm Allinea Studio 20.0: assisting developers write optimized SVE-enabled applications for Linux servers

December 17, 2019

10 minute read time.

The Arm Allinea Studio 20.0 release adds several headline features and improvements to the Arm commercial toolchain for Linux servers. Highlights of this release include:

Improvements to, and full release of arm-opt-report
User created vector functions with simdmath
Support for Neoverse N1 cores
Improvements to vectorization, and base compiler upgrades (LLVM 9.0 and GCC 9.2).
Performance libraries provide significant performance gains:
- New sparse matrix support for BSR (Block Sparse Row) format
- Gustavson's algorithm for SpMM
- New and improved libamath functions
- Performance Improvements for level 3 BLAS routines: DTRSM, *SYRK and *SYR2K
Updated porting and tuning guides

Arm C/C++ and Fortran Compiler

Improvements to arm-opt-report

Arm Optimization Report is a new feature that makes it easier to see the optimization decisions the compiler is making, in-line with your source code. For documentation on how to use Arm Optimization Report, see the report here.

Arm Compiler for Linux 20.0 upgrades Arm Optimization Report to a fully supported feature, with some significant new functionality. The following are some examples of this functionality, and how it should be interpreted.

Advice on the use of the restrict (C) or restrict (C++) keywords to aid vectorization.

Consider the following piece of code, which has many pointer arguments:

Advice on the use of the restrict (C) or __restrict__ (C++) keywords to aid vectorization.

The compiler must check that these pointers do not overlap, which has a run-time cost. If too many pointers might overlap (as in this example), the compiler chooses not to vectorize. However, if you add restrict (or __restrict__) around pointers you know will never overlap, this loop is vectorized.

Arm Optimization Report can now show this guidance, as follows:

Advice on the use of the restrict (C) or __restrict__ (C++) keywords to aid vectorization.

Advice on when inlining might help vectorization.

In the following example, Arm Optimization Report highlights that a loop contains a scalar function inhibiting vectorization. For brevity, the example forces this using the noinline attribute, but in the real world it might be because the function in question contains many operations.

Advice on when inlining might help vectorization.

If you modify inc() so that it is inlined (for example, using the always_inline attribute), the code will vectorize.

Advice on when forced vectorization is unlikely to be beneficial to performance.

It is possible to use pragmas to encourage vectorization, as in this example:

Advice on when forced vectorization is unlikely to be beneficial to performance.

However, for some CPUs it might not be beneficial to vectorize this case. Arm Compiler for Linux can detect this, which allows Arm Optimization Report to advise you that vectorization might not be beneficial:

Advice on when forced vectorization is unlikely to be beneficial to performance.

Better display of function specialization after inlining

In this example, the function foo() contains a loop, with a variable number of iterations. Foo is inlined, and in each case (foo itself, foo_4, and foo_8), the compiler behaves differently. Arm Optimization Report can show these differences, as follows:

Better display of function specialization after inlining

You can see that in foo_4 the loop is fully unrolled with four iterations, and in foo_8 the loop is fully unrolled with eight iterations. In the general function foo, the loop is not unrolled, nor is it vectorized, because it contains a call to a user function.

User-created vector functions

The previous section highlighted some examples where a loop could not be vectorized because of a scalar function call in the loop. In some cases, such calls can be inlined and then vectorized, but this is not always possible.

To prevent this in the general case, the compiler needs a compatible vector variant of any scalar functions that are called within a loop. During autovectorization, the compiler replaces the scalar function call with the correct vector variant, allowing autovectorization to proceed.

If you add -fsimdmath to your command line, Arm Compiler for Linux does this automatically for some of the library functions in math.h and string.h. Suitable vector variants for these functions are shipped as part of the product.

Arm Compiler for Linux 20.0 adds a new feature that allows you to provide your own vectorized variants of user-written functions. These can be implemented using the Arm C Language Extensions (ACLE).

The ABI that defines how Arm Compiler for Linux calls these functions is fully open source with a permissive license. All user code uses standard OpenMP 5.0 pragmas, allowing interoperability between compilers and architectures. To enable this functionality in open-source Clang, Arm is actively working with the Open Source LLVM community.

For more information about how to write vector variants of your scalar functions, see the vector documentation here.

Support for Neoverse N1 cores

Arm Compiler for Linux adds support for systems based on the new Arm Neoverse N1 processor. To enable tuning and libraries for Neoverse N1-based systems, compile with -mcpu=neoverse-n1 -armpl or -mcpu=native -armpl.

Improvements to vectorization and compiler version upgrades

Several 'under the hood' improvements to vectorization are enabled the 20.0 release. These improvements include:

better runtime-checks, allowing the vectorized versions of some loops to be executed more often
a more accurate cost-model, allowing the compiler to choose to vectorize more often.

In addition, the base LLVM version on which Arm Compiler for Linux is based has been upgraded to 9.0. The GNU compiler that is shipped with Arm Compiler for Linux has been upgraded to 9.2. These upgrades add a year of optimizations and bug fixes from the open-source communities of each compiler into the package.

General improvements for system administrators and users

Arm Compiler for Linux 20.0 has several improvements to help system administrators customize the package for their users' needs.

Only install support for a subset of CPUs

Arm Compiler for Linux has variants of Arm Performance Libraries for several supported CPUs. However, some users do not need this flexibility, and would rather save installed disk space by only installing support for a subset of these variants.

The installer in Arm Compiler for Linux allows a subset of variants to be installed with the new --only-install-microarchitectures option.

For more information, see the Arm Allinea Studio package.

Specify site-wide default compiler flags

Arm Compiler for Linux 20.0 adds support for a new compiler configuration file. Using this file, administrators or users can specify a default set of configuration flags for all users of the compiler. For example, it might be appropriate to always tune for the host CPU, so -mcpu=native could be added by default. User-supplied options will always override these defaults.

An example configuration file is provided with a number of example configurations, and the location of the configuration file can be customized using the package module files.

Full documentation of this feature is provided here: how to conifigure Arm Allinea Studio.

Clang autocompletion

Arm Compiler for Linux now supports tab-completion of options and arguments in bash. For instructions on enabling this feature, see how to install Arm Allinea Studio.

Arm Performance Libraries

New sparse matrix support for BSR (Block Sparse Row) format

BSR has been introduced as a sparse matrix input format for both Sparse Matrix-Vector multiplication (SpMV) and Sparse Matrix-Matrix multiplication (SpMM). The new armpl_spmat_create_[sdcz] and armpl_spmat_export_[sdcz] functions are available in both C and Fortran, and have been documented in the Arm Performance Libraries v20.0 reference guide. The reference guide also provides a description for the new BSR format, which also include descriptions for our other supported input formats. For reference, the CSR and BSR formats are similar, except that each entry in the array of column indices represents a dense square block of nonzero values, rather than a single scalar value.

A tuned implementation for SpMV with BSR matrices offers significant performance gains over using the Compressed Sparse Row (CSR) format. For example, the four matrices in the chart below have a 3x3 blocked structure, and therefore are efficiently represented as BSR matrices. Using BSR with Arm PL 20.0 provides up to a doubling in performance over using the CSR format with Arm PL 19.3.

Thank you to Mohammad Zubair, Old Dominion University, for collaborating with us on enhancing the performance of SpMV using BSR.

Arm PL SpMV double precision performance improvement due to BSR

Gustavson's algorithm for SpMM

The sparse routines for matrix-matrix multiplication, which were introduced to Arm PL in 19.3, have been improved to use Gustavson's well-known algorithm for SpMM [1]. The effect of this is to provide a performant routine for the first time, with a scalable, parallel implementation to come in a future release. The following chart shows how Arm PL SpMM now compares with an implementation in another leading numerical library for a range of matrices available in the Sparse Matrix Collection (at https://sparse.tamu.edu/):

SpMM Arm PL Gustavson's algorithm speedup vs. leading implementation

In the chart, blue bars indicate the problems where Arm PL provides a speedup. Yellow bars indicate where Arm PL is less than 10% slower. Red bars indicate where Arm PL is more than 10% slower.

[1] Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250{269, 1978.

Performance Improvements for Level 3 BLAS routines: DTRSM, SYRK and SYR2K

The double-precision, dense matrix solve routine, DTRSM, has been optimized to provide performance improvements across a wide range of problem sizes. Consequently, the performance scales with the number of threads used. For example, the following graph shows the improvements with 8 and 28 threads:

Arm PL DTRSM performance improvement

Performance of the dense symmetric rank-k and rank-2k update routines have similarly been improved, as shown in the following graph. However, in this case, the improvements have been implemented for all datatypes and for two routine types: [SDCZ]SYRK and [SDCZ]SYR2K, which have all benefited from the performance optimizations that are illustrated for the double-precision rank-k case.

Arm PL DSYRK performance improvement

Performance Improvements for libamath functions

Improved scalar implementations are introduced into libamath for replacement libm functions erf (double-precision- error function) and erfc (double-precision- complementary error function). For key intervals in the input domain, the performance gains are illustrated in the following graph:

Arm PL erf double precision performance improvement

The most important algorithmic improvements have been made in the interval [1.25, 6], and this is shown in the previous results where the significant performance improvements come from input samples 2.5 and 4. Since erf is an even function, the same performance is reflected for negative values. Similar gains are seen in the odd function erfc, this time with the biggest gains in the intervals [-6, -1.25] and [1.25, 28].

You will also see faster performance in log10 (double-precision log10) and log10f (single-precision log10) in both scalar and vector implementations. The greatest speedup comes from the scalar improvements, but codes linked to the vector routines should also see big improvements.

Arm PL log10 double precision performance improvement Documentation

Updated porting and tuning guides

We have released version 2.0 of our Porting and Tuning HPC Applications for Arm and Porting and Tuning HPC Applications for Arm SVE guides to provide you with the latest information about porting your codes to Arm. Content updates include new information about:

Compiling for Neon with auto-autovectorization.
Iteratively optimizing your application with Arm Forge & Arm Performance Reports.
Using Arm Optimization Report to get insights into the decisions Arm Compiler for Linux is making.
OpenMP thread mapping with OpenMP and hybrid OpenMP/MPI applications.
Coding for SVE vs coding for Neon.
Optimizing HPCG for SVE.
A methodology to port and optimize an application for SVE, including information about some new open-sourced tools.

Both porting guides are now also shipped as part of the Arm Compiler for Linux product in an offline-accessible HTML format in <install-package>/share/doc.

Support

If you have questions or want to raise an issue, you can do so by emailing the HPC software support team or by visiting the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm-based servers.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 20.0 with major enhancements to our Linux compiler and our optimized mathematical libraries. Our next major version of the Arm Compiler for Linux is expected towards the end of March 2020, and will include major performance improvements for SVE-based microarchitectures. We expect this will give the HPC community plenty of time to update benchmark numbers in preparation for ISC'20 in Frankfurt.

Download Arm Allinea Studio 20.0

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Arm Allinea Studio 20.0: assisting developers write optimized SVE-enabled applications for Linux servers