Arm Allinea Studio 19.2 is now available. This new major release includes valuable updates to the Arm Performance Libraries (Arm PL) and the Arm Compiler for Linux. This new version includes our first attempt at the Arm Opt Report and the introduction of performance improvements to ML half-precision interfaces to matrix-matrix multiplications and FFTs in Arm PL. In addition, string handling optimizations have been added to a new library (libastring) and added to the compiler by default.
Arm Opt Report is a new, beta-quality feature of Arm Compiler for Linux 19.2 that builds upon the llvm-opt-report tool found in open-source LLVM. The new Arm Opt Report feature makes it easier to see what optimization decisions the compiler is making, in-line with user source code, answering questions such as:
Note:
SVE scaling factor = <true SVE vector width> / 128
What was the interleave count?Interleaving is a combination of vectorization followed by unrolling; multiple streams of vector instructions are performed in each iteration of the loop.This information, in combination, lets you know how many iterations of the original scalar loop are performed in each iteration of the generated code.
Number of scalar iterations = <unroll factor> x <vectorization factor> x <interleave count> x <SVE scaling factor>
opt.yaml
-fsave-optimization-record
arm-opt-report
As an example, we build the following source code:
void bar(); void foo() { bar(); } void Test(int *res, int *c, int *d, int *p, int n) { int i; #pragma clang loop vectorize(assume_safety) for (i = 0; i < 1600; i++) { res[i] = (p[i] == 0) ? res[i] : res[i] + d[i]; } for (i = 0; i < 16; i++) { res[i] = (p[i] == 0) ? res[i] : res[i] + d[i]; } foo(); foo(); bar(); foo(); }
First, we will build this function as a shared object file:
$ armclang -O3 -fsave-optimization-record or.c -c -o or.o
This generates a file, or.opt.yaml, in the same directory as the built object. For compilations that create multiple object files, there is a report for each build object.
or.opt.yaml
or.opt.yaml file can be viewed by arm-opt-report:
$ arm-opt-report or.opt.yaml < or.c 1 | void bar(); 2 | void foo() { bar(); } 3 | 4 | void Test(int *res, int *c, int *d, int *p, int n) { 5 | int i; 6 | 7 | #pragma clang loop vectorize(assume_safety) 8 V4,1 | for (i = 0; i < 1600; i++) { 9 | res[i] = (p[i] == 0) ? res[i] : res[i] + d[i]; 10 | } 11 | 12 U16 | for (i = 0; i < 16; i++) { 13 | res[i] = (p[i] == 0) ? res[i] : res[i] + d[i]; 14 | } 15 | 16 I | foo(); 17 | 18 | foo(); bar(); foo(); I | ^ I | ^ 19 | }
This can be interpreted as follows:
for
foo()
The 19.2 release added experimental scheduler improvements that can give performance benefits on large processors, such as ThunderX2. By default, the scheduler improvements are disabled. To enable them, include the "-mllvm -misched-favour-latency=true" option at compile time. We welcome feedback on your experience with this option, which might become the default scheduler in future releases.
Libamath and libastring provide efficient scalar and vector versions of some common math and memory/string library functions, for both C/C++ and Fortran workloads.
By default, libamath/libastring scalar routines are used rather than those provided by your operating system libraries. We believe the libamath versions should be more efficient in all cases.
libamath vector routines are used by the compiler during auto-vectorization. When the compiler encounters a scalar math routine (or Fortran math intrinsic) in a loop, it can replace this call with the vector version, allowing auto-vectorization to proceed. This functionality can be enabled by adding the "-armpl" option.
See the libraries section below for more details of the routines provided by libamath and libastring.
The Fortran 2008 {{ERROR STOP}} statement is now supported.
A new flag -fno-realloc-lhs has been added, for consistency with GNU compilers. Use -fno-realloc-lhs in place of -Mallocatable=95, which is no longer documented, but is still supported. Refer to the Fortran Reference Guide for information about this flag.
The environment variable ARM_LICENCE_DIR can now be used to set the license search path. This environment variable is consistent with Arm Forge. The old environment variable (ARM_HPC_COMPILER_LICENSE_SEARCH_PATH) is still supported but is not recommended.
Since the last blog post we have been working on a variety of performance improvements in Arm Performance Libraries. The most significant gains should be seen in our FFT routines, especially for single precision data and also for powers of 2 transform lengths in either single or double precision. Below we highlight some of the other major changes that have been made in the 19.2 release.
Half precision support
In the 19.2 release we have introduced half precision library routines for the first time. There is a complete set of half precision FFT routines in C, matching the functionality available in single and double precision, and we also have a half precision matrix-matrix multiplication routine which matches the standard L3 BLAS *GEMM interfaces. The latter may be accessed in C via the hgemm_() function.
An example of using hgemm_ would be based upon using the __fp16 half precision type:
__fp16 *A, *B, *C; __fp16 alpha, beta; … hgemm_(&transa, &transb, &m, &n, &k, &alpha, A, &lda, B, &ldb, &beta, C, &ldc);
The naming scheme for the FFTW interfaces has been extended, such that all routines are prefixed fftwh_. An example of how to use them would be based upon:
/* Include Arm Performance Libraries FFT interface. Make sure you include the header file provided by Arm PL and not the header provided by FFTW3.*/ #include "fftw3.h" /* Declare half-precision arrays to be used */ __fp16 *in; fftwh_complex *out; fftwh_plan plan; /* Plan, execute and destroy */ plan = fftwh_plan_many_dft_r2c(...); fftwh_execute(plan); fftwh_destroy_plan(plan);
These functions are available for all CPU targets. On machines where native half precision instructions are available (in other words, for Armv8.2 capable CPUs and above) those instructions will be utilized in order to provide a performance improvement. On machines where native half precision instructions are not available, computation will be performed in single precision; the library will take care of converting the input and output data as necessary.
Improved libamath vector functions
We have added new implementations of a number of vector math functions in libamath, which provide some significant performance improvements. The following results illustrate the improvements observed on a Marvell ThunderX2 system when running the Elefunt benchmark where the time taken has been normalized to the result achieved using GCC 8.2 with the standard libm (in other words, a value greater than 1 indicates faster performance than with GCC).
libastring: A new library of string functions
libastring is a new library at version 19.2 which contains optimized versions of some of the string functions found in glibc. The performance improvements seen for functions memcpy and memset compared with the same functions found in the default glibc library, are given below:
As with libamath, the libastring library will be linked in to your application automatically by the Arm Compiler, meaning that you do not need to do anything other than re-link you application in order to benefit from the performance improvements provided by this library.
Our technical authors have put together two comprehensive porting guides to assist developers migrating to Arm. These guides are readily available online and include great tips and tricks to extract the last drop of performance from your Arm hardware.
If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. The vast majority of requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.
I am excited to announce the availability of Arm Allinea Studio 19.2 with major enhancements to compiler and libraries. Please get in touch to request a trial or buy a license. We plan to provide the next major release 19.3 towards the end of August 2019, with more features and improvements.