The Arm Allinea Studio 20.0 release adds several headline features and improvements to the Arm commercial toolchain for Linux servers. Highlights of this release include:
Arm Optimization Report is a new feature that makes it easier to see the optimization decisions the compiler is making, in-line with your source code. For documentation on how to use Arm Optimization Report, see the report here.
Arm Compiler for Linux 20.0 upgrades Arm Optimization Report to a fully supported feature, with some significant new functionality. The following are some examples of this functionality, and how it should be interpreted.
Consider the following piece of code, which has many pointer arguments:
The compiler must check that these pointers do not overlap, which has a run-time cost. If too many pointers might overlap (as in this example), the compiler chooses not to vectorize. However, if you add restrict (or __restrict__) around pointers you know will never overlap, this loop is vectorized.
Arm Optimization Report can now show this guidance, as follows:
In the following example, Arm Optimization Report highlights that a loop contains a scalar function inhibiting vectorization. For brevity, the example forces this using the noinline attribute, but in the real world it might be because the function in question contains many operations.
If you modify inc() so that it is inlined (for example, using the always_inline attribute), the code will vectorize.
It is possible to use pragmas to encourage vectorization, as in this example:
However, for some CPUs it might not be beneficial to vectorize this case. Arm Compiler for Linux can detect this, which allows Arm Optimization Report to advise you that vectorization might not be beneficial:
In this example, the function foo() contains a loop, with a variable number of iterations. Foo is inlined, and in each case (foo itself, foo_4, and foo_8), the compiler behaves differently. Arm Optimization Report can show these differences, as follows:
You can see that in foo_4 the loop is fully unrolled with four iterations, and in foo_8 the loop is fully unrolled with eight iterations. In the general function foo, the loop is not unrolled, nor is it vectorized, because it contains a call to a user function.
The previous section highlighted some examples where a loop could not be vectorized because of a scalar function call in the loop. In some cases, such calls can be inlined and then vectorized, but this is not always possible.
To prevent this in the general case, the compiler needs a compatible vector variant of any scalar functions that are called within a loop. During autovectorization, the compiler replaces the scalar function call with the correct vector variant, allowing autovectorization to proceed.
If you add -fsimdmath to your command line, Arm Compiler for Linux does this automatically for some of the library functions in math.h and string.h. Suitable vector variants for these functions are shipped as part of the product.
Arm Compiler for Linux 20.0 adds a new feature that allows you to provide your own vectorized variants of user-written functions. These can be implemented using the Arm C Language Extensions (ACLE).
The ABI that defines how Arm Compiler for Linux calls these functions is fully open source with a permissive license. All user code uses standard OpenMP 5.0 pragmas, allowing interoperability between compilers and architectures. To enable this functionality in open-source Clang, Arm is actively working with the Open Source LLVM community.
For more information about how to write vector variants of your scalar functions, see the vector documentation here.
Arm Compiler for Linux adds support for systems based on the new Arm Neoverse N1 processor. To enable tuning and libraries for Neoverse N1-based systems, compile with -mcpu=neoverse-n1 -armpl or -mcpu=native -armpl.
Several 'under the hood' improvements to vectorization are enabled the 20.0 release. These improvements include:
In addition, the base LLVM version on which Arm Compiler for Linux is based has been upgraded to 9.0. The GNU compiler that is shipped with Arm Compiler for Linux has been upgraded to 9.2. These upgrades add a year of optimizations and bug fixes from the open-source communities of each compiler into the package.
Arm Compiler for Linux 20.0 has several improvements to help system administrators customize the package for their users' needs.
Arm Compiler for Linux has variants of Arm Performance Libraries for several supported CPUs. However, some users do not need this flexibility, and would rather save installed disk space by only installing support for a subset of these variants.
The installer in Arm Compiler for Linux allows a subset of variants to be installed with the new --only-install-microarchitectures option.
For more information, see the Arm Allinea Studio package.
Arm Compiler for Linux 20.0 adds support for a new compiler configuration file. Using this file, administrators or users can specify a default set of configuration flags for all users of the compiler. For example, it might be appropriate to always tune for the host CPU, so -mcpu=native could be added by default. User-supplied options will always override these defaults.
An example configuration file is provided with a number of example configurations, and the location of the configuration file can be customized using the package module files.
Full documentation of this feature is provided here: how to conifigure Arm Allinea Studio.
Arm Compiler for Linux now supports tab-completion of options and arguments in bash. For instructions on enabling this feature, see how to install Arm Allinea Studio.
BSR has been introduced as a sparse matrix input format for both Sparse Matrix-Vector multiplication (SpMV) and Sparse Matrix-Matrix multiplication (SpMM). The new armpl_spmat_create_[sdcz] and armpl_spmat_export_[sdcz] functions are available in both C and Fortran, and have been documented in the Arm Performance Libraries v20.0 reference guide. The reference guide also provides a description for the new BSR format, which also include descriptions for our other supported input formats. For reference, the CSR and BSR formats are similar, except that each entry in the array of column indices represents a dense square block of nonzero values, rather than a single scalar value.
A tuned implementation for SpMV with BSR matrices offers significant performance gains over using the Compressed Sparse Row (CSR) format. For example, the four matrices in the chart below have a 3x3 blocked structure, and therefore are efficiently represented as BSR matrices. Using BSR with Arm PL 20.0 provides up to a doubling in performance over using the CSR format with Arm PL 19.3.
Thank you to Mohammad Zubair, Old Dominion University, for collaborating with us on enhancing the performance of SpMV using BSR.
The sparse routines for matrix-matrix multiplication, which were introduced to Arm PL in 19.3, have been improved to use Gustavson's well-known algorithm for SpMM [1]. The effect of this is to provide a performant routine for the first time, with a scalable, parallel implementation to come in a future release. The following chart shows how Arm PL SpMM now compares with an implementation in another leading numerical library for a range of matrices available in the Sparse Matrix Collection (at https://sparse.tamu.edu/):
In the chart, blue bars indicate the problems where Arm PL provides a speedup. Yellow bars indicate where Arm PL is less than 10% slower. Red bars indicate where Arm PL is more than 10% slower.
[1] Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250{269, 1978.
The double-precision, dense matrix solve routine, DTRSM, has been optimized to provide performance improvements across a wide range of problem sizes. Consequently, the performance scales with the number of threads used. For example, the following graph shows the improvements with 8 and 28 threads:
Performance of the dense symmetric rank-k and rank-2k update routines have similarly been improved, as shown in the following graph. However, in this case, the improvements have been implemented for all datatypes and for two routine types: [SDCZ]SYRK and [SDCZ]SYR2K, which have all benefited from the performance optimizations that are illustrated for the double-precision rank-k case.
Improved scalar implementations are introduced into libamath for replacement libm functions erf (double-precision- error function) and erfc (double-precision- complementary error function). For key intervals in the input domain, the performance gains are illustrated in the following graph:
The most important algorithmic improvements have been made in the interval [1.25, 6], and this is shown in the previous results where the significant performance improvements come from input samples 2.5 and 4. Since erf is an even function, the same performance is reflected for negative values. Similar gains are seen in the odd function erfc, this time with the biggest gains in the intervals [-6, -1.25] and [1.25, 28].
You will also see faster performance in log10 (double-precision log10) and log10f (single-precision log10) in both scalar and vector implementations. The greatest speedup comes from the scalar improvements, but codes linked to the vector routines should also see big improvements.
Documentation
We have released version 2.0 of our Porting and Tuning HPC Applications for Arm and Porting and Tuning HPC Applications for Arm SVE guides to provide you with the latest information about porting your codes to Arm. Content updates include new information about:
Both porting guides are now also shipped as part of the Arm Compiler for Linux product in an offline-accessible HTML format in <install-package>/share/doc.
If you have questions or want to raise an issue, you can do so by emailing the HPC software support team or by visiting the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm-based servers.
I am excited to announce the availability of Arm Allinea Studio 20.0 with major enhancements to our Linux compiler and our optimized mathematical libraries. Our next major version of the Arm Compiler for Linux is expected towards the end of March 2020, and will include major performance improvements for SVE-based microarchitectures. We expect this will give the HPC community plenty of time to update benchmark numbers in preparation for ISC'20 in Frankfurt.
[CTAToken URL = "https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/arm-allinea-studio/download" target="_blank" text="Download Arm Allinea Studio 20.0" class ="green"]