Arm Allinea Studio 19.2: building on Libraries and Arm Compiler for Linux performance

June 25, 2019

7 minute read time.

Arm Allinea Studio 19.2 is now available. This new major release includes valuable updates to the Arm Performance Libraries (Arm PL) and the Arm Compiler for Linux. This new version includes our first attempt at the Arm Opt Report and the introduction of performance improvements to ML half-precision interfaces to matrix-matrix multiplications and FFTs in Arm PL. In addition, string handling optimizations have been added to a new library (libastring) and added to the compiler by default.

Introduction to Arm Opt Report

Arm Opt Report is a new, beta-quality feature of Arm Compiler for Linux 19.2 that builds upon the llvm-opt-report tool found in open-source LLVM. The new Arm Opt Report feature makes it easier to see what optimization decisions the compiler is making, in-line with user source code, answering questions such as:

Was a loop unrolled? Unrolling is when a scalar loop is transformed to perform multiple iterations at once, but still as scalar instructions.
If so, what was the unroll factor? The unroll factor is the number of iterations of the original loop that are performed at once. Sometimes, loops with known small iteration counts are completely unrolled, such that no loop structure remains. In completely unrolled cases, the unroll factor will be the total scalar iteration count.
Was a loop vectorized? Vectorization is when multiple iterations of a scalar loop are replaced by a single iteration of vector instructions.
What was the vectorization factor? The vectorization factor is the number of lanes in the vector unit, and corresponds to the number of scalar iterations performed by each vector instruction

Note:

The true vectorization factor is unknown at compile-time for SVE, because SVE supports scalable vectors.
For this reason, when SVE is enabled, Arm Opt Report reports a vectorization factor that corresponds to a 128-bit SVE implementation.
If you are working with an SVE implementation with a larger vector width (for example, 256 or 512 bits), the number of scalar iterations performed by each vector instruction will increase proportionally.

SVE scaling factor = <true SVE vector width> / 128

What was the interleave count?

Interleaving is a combination of vectorization followed by unrolling; multiple streams of vector instructions are performed in each iteration of the loop.

This information, in combination, lets you know how many iterations of the original scalar loop are performed in each iteration of the generated code.

Number of scalar iterations = <unroll factor> x <vectorization factor> x <interleave count> x <SVE scaling factor>

How to use Arm Opt Report

To generate a machine-readable opt.yaml report, add -fsave-optimization-record to your command line.
Inspect opt.yaml report, as augmented source code, using the arm-opt-report tool.

As an example, we build the following source code:

void bar();
void foo() { bar(); }


void Test(int *res, int *c, int *d, int *p, int n) {
  int i;


#pragma clang loop vectorize(assume_safety)
  for (i = 0; i < 1600; i++) {
    res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
  }


  for (i = 0; i < 16; i++) {
    res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
  }


  foo();


  foo(); bar(); foo();
}

First, we will build this function as a shared object file:

$ armclang -O3 -fsave-optimization-record or.c -c -o or.o

This generates a file, or.opt.yaml, in the same directory as the built object. For compilations that create multiple object files, there is a report for each build object.

or.opt.yaml file can be viewed by arm-opt-report:

$ arm-opt-report or.opt.yaml
< or.c
 1          | void bar();
 2          | void foo() { bar(); }
 3          |
 4          | void Test(int *res, int *c, int *d, int *p, int n) {
 5          |   int i;
 6          |
 7          | #pragma clang loop vectorize(assume_safety)
 8     V4,1 |   for (i = 0; i < 1600; i++) {
 9          |     res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
10          |   }
11          |
12  U16     |   for (i = 0; i < 16; i++) {
13          |     res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14          |   }
15          |
16 I        |   foo();
17          |
18          |   foo(); bar(); foo();
   I        |   ^
   I        |                 ^
19          | }

This can be interpreted as follows:

The for loop on line 10
- was vectorized
- has a vectorization factor of 4 (there are 4 32-bit integer lanes)
- has an interleave factor of 1 (so was not interleaved)
The for loop on line 14 was unrolled 16 times. This means it was completely unrolled, with no remaining loop.
All 3 instances of foo() were inlined.

Improved instruction scheduling for large cores

The 19.2 release added experimental scheduler improvements that can give performance benefits on large processors, such as ThunderX2. By default, the scheduler improvements are disabled. To enable them, include the "-mllvm -misched-favour-latency=true" option at compile time. We welcome feedback on your experience with this option, which might become the default scheduler in future releases.

Use of libamath and libastring in Arm Compiler for Linux

Libamath and libastring provide efficient scalar and vector versions of some common math and memory/string library functions, for both C/C++ and Fortran workloads.

By default, libamath/libastring scalar routines are used rather than those provided by your operating system libraries. We believe the libamath versions should be more efficient in all cases.

libamath vector routines are used by the compiler during auto-vectorization. When the compiler encounters a scalar math routine (or Fortran math intrinsic) in a loop, it can replace this call with the vector version, allowing auto-vectorization to proceed. This functionality can be enabled by adding the "-armpl" option.

See the libraries section below for more details of the routines provided by libamath and libastring.

Fortran support

The Fortran 2008 {{ERROR STOP}} statement is now supported.

A new flag -fno-realloc-lhs has been added, for consistency with GNU compilers. Use -fno-realloc-lhs in place of -Mallocatable=95, which is no longer documented, but is still supported. Refer to the Fortran Reference Guide for information about this flag.

License file environment variable

The environment variable ARM_LICENCE_DIR can now be used to set the license search path. This environment variable is consistent with Arm Forge. The old environment variable (ARM_HPC_COMPILER_LICENSE_SEARCH_PATH) is still supported but is not recommended.

Libraries

Since the last blog post we have been working on a variety of performance improvements in Arm Performance Libraries. The most significant gains should be seen in our FFT routines, especially for single precision data and also for powers of 2 transform lengths in either single or double precision. Below we highlight some of the other major changes that have been made in the 19.2 release.

Half precision support

In the 19.2 release we have introduced half precision library routines for the first time. There is a complete set of half precision FFT routines in C, matching the functionality available in single and double precision, and we also have a half precision matrix-matrix multiplication routine which matches the standard L3 BLAS *GEMM interfaces. The latter may be accessed in C via the hgemm_() function.

An example of using hgemm_ would be based upon using the __fp16 half precision type:

            __fp16 *A, *B, *C;
            __fp16 alpha, beta;
            …
            hgemm_(&transa, &transb, &m, &n, &k, &alpha, A, &lda, B, &ldb, &beta, C, &ldc);

The naming scheme for the FFTW interfaces has been extended, such that all routines are prefixed fftwh_. An example of how to use them would be based upon:

	/* Include Arm Performance Libraries FFT interface. Make sure you include the
	    header file provided by Arm PL and not the header provided by FFTW3.*/
	#include "fftw3.h"
 
	/* Declare half-precision arrays to be used */
	__fp16 *in;
	fftwh_complex *out;
	fftwh_plan plan;
 
	/* Plan, execute and destroy */
	plan = fftwh_plan_many_dft_r2c(...);
	fftwh_execute(plan);
	fftwh_destroy_plan(plan);

These functions are available for all CPU targets. On machines where native half precision instructions are available (in other words, for Armv8.2 capable CPUs and above) those instructions will be utilized in order to provide a performance improvement. On machines where native half precision instructions are not available, computation will be performed in single precision; the library will take care of converting the input and output data as necessary.

Improved libamath vector functions

We have added new implementations of a number of vector math functions in libamath, which provide some significant performance improvements. The following results illustrate the improvements observed on a Marvell ThunderX2 system when running the Elefunt benchmark where the time taken has been normalized to the result achieved using GCC 8.2 with the standard libm (in other words, a value greater than 1 indicates faster performance than with GCC).

Arm PL 19.2 libamath performance improvements

libastring: A new library of string functions

libastring is a new library at version 19.2 which contains optimized versions of some of the string functions found in glibc. The performance improvements seen for functions memcpy and memset compared with the same functions found in the default glibc library, are given below:

Arm PL 19.2 Libastring

As with libamath, the libastring library will be linked in to your application automatically by the Arm Compiler, meaning that you do not need to do anything other than re-link you application in order to benefit from the performance improvements provided by this library.

New Porting and Tuning Guides

Our technical authors have put together two comprehensive porting guides to assist developers migrating to Arm. These guides are readily available online and include great tips and tricks to extract the last drop of performance from your Arm hardware.

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. The vast majority of requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 19.2 with major enhancements to compiler and libraries. Please get in touch to request a trial or buy a license. We plan to provide the next major release 19.3 towards the end of August 2019, with more features and improvements.

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog