Arm Allinea Studio 19.0 now available with improved compiler and libraries

Arm Allinea Studio 19.0 is now available, with an updated Arm Performance Libraries (ArmPL) and improved compilers. In this blog, I have captured the highlights of the release. ArmPL has improvements to sparse routine support, support for the FFT Guru and MPI interfaces, and parallel scaling improvements to key routines. Compiler improvements include better integration of ArmPL and the compiler, new Fortran directives to control vectorization and performance improvements to Fortran math intrinsics.

Major update to Arm Performance Libraries

Sparse routine support

Support for sparse matrix-vector multiplication (SpMV) has been added to Arm Performance Libraries at 19.0. Our interface follows the inspector-executor model, whereby users provide their input matrix in a commonly used format, such as Compressed Sparse Rows (CSR), to a "create" function which returns an opaque handle to an armpl_spmat_t type used to identify the matrix. After creation users may supply hints about the structure of the matrix, such as whether it will be used in transpose or conjugate-transpose form, or whether the user wants the library to allocate memory internally and how many times the same matrix will be used in an SpMV execution. These hints are then optionally used during a call to optimize internal data structures. If the library is permitted to allocate memory then new data structures may be created (and the original ones freed) in order to provide faster SpMV execution. We have also provided a function to allow users to update the values of the non-zero elements in the matrix. Our interface supports the usual data types: single and double precision real and complex, and execution functions are parallelized with OpenMP.

At 19.0 these interfaces are provided in C along with an example of usage, and comprehensive documentation. The CSR input format is supported at 19.0 along with an optimized execution option for double precision real data in the case of many SpMV executions using the same matrix. Support for CSC and COO formats and Fortran will appear in 19.1. 

FFTW interface updates

Arm Performance Libraries now provides support for FFTW's guru and MPI interfaces, which means that there is now complete support provided for the DFT functionality in FFTW. The basic, advanced and guru functions benefit from the same optimized NEON FFT kernels within the library, with OpenMP parallelization for some cases.

Support for the FFTW MPI C interface at 19.0 is for compatibility, with performance enhancements and Fortran support scheduled to appear in later releases. A copy of the fftw-mpi.h header file is available in our include directories which means that instructions for compiling and linking code containing calls to FFTW MPI functions are the same as for any other function in the library, except of course that you should use the MPI compiler wrapper scripts (e.g. mpicc) provided as part of your MPI installation.  Note users must include the ArmPL version of fftw-mpi.h rather than that from FFTW.

Parallel scaling improvements to key ArmPL routines

Users should see significant gains in DGEMM (double precision real general matrix-matrix multiplication) performance when using multithreading. The improvements over version 18.4.2 are illustrated below for increasing square problem sizes. Arm Performance Libraries can now attain over 90% of the theoretical peak performance of a Cavium Thunder X2 dual socket system, using 56 threads in this case. A particularly important feature of this graph is that DGEMM is now achieving its peak performance much sooner, running at 90% of peak for relatively modest problem sizes

Parallel performance scalability tuning has also been carried out for a number of key LAPACK routines, improving load balance for xPOTRF, xGEQR and xGETRF (Cholesky, QR and LU factorization, respectively) on Cavium ThunderX2 systems when using many threads. The bar chart below shows the speedups gained in 19.0 compared with 18.4.2 for the real Cholesky factorization routines SPOTRF and DPOTRF. A bar height of 1 indicates no improvement; greater than 1 indicates the speedup in 19.0.

Fortran Compiler enhancements

New Fortran directives to control vectorization

NOVECTOR directive

The Fortran NOVECTOR directive allows you to disable auto-vectorization on individual loops. Refer to Directives chapter of Arm Fortran Compiler Language Reference guide for further information and an example.

VECTOR ALWAYS directive

The Fortran VECTOR ALWAYS directive allows you to request that a loop be auto-vectorized, irrespective of the compiler's internal cost-model, if it is safe to do so. Refer to Directives chapter of Arm Fortran Compiler Language Reference guide for further information and an example.

Improved Fortran language support

The 19.0 Fortran Compiler has better support for Fortran language standards (both Fortran 2003 and 2008). Specifically

  • Default support for Fortran 2003 allocatable array semantics. Previously the default behavior was for Fortran 95 semantics, requiring "-Mallocatable=03" to be set for Fortran 2003 semantics to be supported - this allows the dynamic allocation and reallocation of the left-hand-side of allocatable array assignments.
  • Support for SUBMODULES;
  • Partial support for DO CONCURRENT;

Enhanced integration between Arm Performance Libraries and Arm Compiler

ArmPL provides optimized standard core math libraries for high-performance computing applications on Arm processors. Arm Compiler for HPC 19.0 introduces a new compiler flag -armpl, which makes these libraries significantly easier to use.

In order to use ArmPL, you need to make three decisions:

  1. Which CPU should the code be optimized for?
  2. How large should integers be?
  3. Do you want to use OpenMP thread parallelism?

We'll discuss these decisions in turn.

Decision 1: Which CPU should the code be optimized for?

ArmPL provides tuned implementations for each supported CPU architecture, plus a generic implementation that will execute on any supported Armv8-A computer.

When using the -armpl flag, you must indicate your choice of implementation using the -mcpu=<arg> flag, using the following options:

generic Generates portable output suitable for any Armv8-A computer.
native Auto-detect the CPU architecture from the build computer.
cortex-a72 Optimizes for Arm Cortex-A72 based computers.
thunderx2t99 Optimizes for Cavium ThunderX2® based computers.


For the majority of HPC users, performance is more important than code portability, and the system on which you are compiling has the same CPU as the system on which you are executing. If so, we recommend you use -mcpu=native.

This decision will affect both the compiler-generated code and the choice of Arm Performance libraries.

Decision 2: How large should integers be?

ArmPL provides two interfaces:

lp64 Integers are 32-bit (known as lp64 because 'long' and 'pointer' types are 64-bit). This is the default.
ilp64 Integers are 64-bit (known as ilp64 because 'integer', 'long' and 'pointer' types are 64-bit).

For many users, 32-bit integers are sufficient, and allows better use of caches and memory bandwidth.

Fortran users who specify the -i8 flag (which indicates that the Fortran INTEGER type should be promoted to 64-bits) will almost certainly want to use the ilp64 interface. Therefore, the presence of the -i8 flag changes the default ArmPL interface to ilp64.

To explicitly define the ArmPL interface, specify it as an argument to the -armpl flag, for example, -armpl=ilp64.

Decision 3: Do you want to use OpenMP thread parallelism?

ArmPL provides two variants:

sequential Use the single-threaded implementation of Arm Performance Libraries.
parallel Use the OpenMP multi-threaded implementation of Arm Performance Libraries.


If you are using OpenMP for your own code (by using the -fopenmp flag), parallel is the default.
If you are not using OpenMP for your own code, sequential is the default.

To explicitly define the ArmPL implementation, specify it as an argument to the --armpl flag, for example, --armpl=parallel.

Multiple comma-separated arguments can be supplied to -armpl, for example, -armpl=parallel,ilp64.

Building with -armpl

The -armpl and -mcpu flags enable the compiler to find appropriate ArmPL header files (during compilation) and libraries (during linking).

Note: If your build process compiles and links as two separate steps, please ensure you add the same -armpl and -mcpu options to both.

Additional benefits of the -armpl option

The -armpl option also enables a number of further optimizations:

  • Optimized versions of selected ones of the C mathematical functions declared in math.h
  • Optimized versions of Fortran math intrinsics
  • Auto-vectorization of C mathematical functions (disable this with -fno-simdmath)
  • Auto-vectorization of Fortran math intrinsics (disable this with -fno-simdmath).

Examples

armflang -i8 -armpl -mcpu=native
Uses the serial, ilp64 ArmPL libraries, optimized for the CPU architecture of the build computer.
armclang -armpl -fopenmp -mcpu=generic
Uses the parallel, lp64 ArmPL libraries, with portable output suitable for any Armv8-A computer.
armclang -armpl=parallel,ilp64 -mcpu=cortex-a72
Uses the parallel, ilp64 ArmPL libraries, optimized for Cortex-A72 based computers.

Further examples of how to use the -armpl flag can be found in the Arm Performance Libraries Getting Started Guide

New porting and tuning guides

Guides for building MVAPICH 2.3 and OpenUCX are now available on developer.arm.com.

MVAPICH2 is a performant, scalable and fault-tolerant implementation of the MPI 3.1 standard for high-end computing systems and is able to exploit a wide range of networking technologies, including InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE.

OpenUCX is a collaboration between industry, national laboratories, and academia to create an open-source communication framework for data-centric and high-performance applications. The new guide on developer.arm.com details how to build OpenUCX with the Arm HPC Compiler, and how to configure Open MPI to adopt the OpenUCX transport layer. More information about both projects can be found on the Open MPI and OpenUCX web pages, respectively.

GNU8 Compiler

Arm Allinea Studio ships with GNU 8.2 compiler with support for C/C++ and Fortran applications. GNU 8 brings many performance improvements for AArch64 including SVE support in addition to generic improvements over GNU 7. GNU 8 change log page provides detailed information on the changes.

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. The vast majority of requests are answered within a single working day. The revamped HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

We are excited to announce the availability of Arm Allinea Studio 19.0 with major enhancements to compiler and libraries. Please get in touch to request a trial or buy a license. We plan to provide the next major release 19.1 in Feb 2019 with more features and improvements.

Download Arm Allinea Studio 19.0

Related information

Anonymous
HPC