Arm Allinea Studio 19.0 is now available, with an updated Arm Performance Libraries (ArmPL) and improved compilers. In this blog, I have captured the highlights of the release. ArmPL has improvements to sparse routine support, support for the FFT Guru and MPI interfaces, and parallel scaling improvements to key routines. Compiler improvements include better integration of ArmPL and the compiler, new Fortran directives to control vectorization and performance improvements to Fortran math intrinsics.
Support for sparse matrix-vector multiplication (SpMV) has been added to Arm Performance Libraries at 19.0. Our interface follows the inspector-executor model, whereby users provide their input matrix in a commonly used format, such as Compressed Sparse Rows (CSR), to a "create" function which returns an opaque handle to an armpl_spmat_t type used to identify the matrix. After creation users may supply hints about the structure of the matrix, such as whether it will be used in transpose or conjugate-transpose form, or whether the user wants the library to allocate memory internally and how many times the same matrix will be used in an SpMV execution. These hints are then optionally used during a call to optimize internal data structures. If the library is permitted to allocate memory then new data structures may be created (and the original ones freed) in order to provide faster SpMV execution. We have also provided a function to allow users to update the values of the non-zero elements in the matrix. Our interface supports the usual data types: single and double precision real and complex, and execution functions are parallelized with OpenMP.
At 19.0 these interfaces are provided in C along with an example of usage, and comprehensive documentation. The CSR input format is supported at 19.0 along with an optimized execution option for double precision real data in the case of many SpMV executions using the same matrix. Support for CSC and COO formats and Fortran will appear in 19.1.
Arm Performance Libraries now provides support for FFTW's guru and MPI interfaces, which means that there is now complete support provided for the DFT functionality in FFTW. The basic, advanced and guru functions benefit from the same optimized NEON FFT kernels within the library, with OpenMP parallelization for some cases.
Support for the FFTW MPI C interface at 19.0 is for compatibility, with performance enhancements and Fortran support scheduled to appear in later releases. A copy of the fftw-mpi.h header file is available in our include directories which means that instructions for compiling and linking code containing calls to FFTW MPI functions are the same as for any other function in the library, except of course that you should use the MPI compiler wrapper scripts (e.g. mpicc) provided as part of your MPI installation. Note users must include the ArmPL version of fftw-mpi.h rather than that from FFTW.
Users should see significant gains in DGEMM (double precision real general matrix-matrix multiplication) performance when using multithreading. The improvements over version 18.4.2 are illustrated below for increasing square problem sizes. Arm Performance Libraries can now attain over 90% of the theoretical peak performance of a Cavium Thunder X2 dual socket system, using 56 threads in this case. A particularly important feature of this graph is that DGEMM is now achieving its peak performance much sooner, running at 90% of peak for relatively modest problem sizes
Parallel performance scalability tuning has also been carried out for a number of key LAPACK routines, improving load balance for xPOTRF, xGEQR and xGETRF (Cholesky, QR and LU factorization, respectively) on Cavium ThunderX2 systems when using many threads. The bar chart below shows the speedups gained in 19.0 compared with 18.4.2 for the real Cholesky factorization routines SPOTRF and DPOTRF. A bar height of 1 indicates no improvement; greater than 1 indicates the speedup in 19.0.
The Fortran NOVECTOR directive allows you to disable auto-vectorization on individual loops. Refer to Directives chapter of Arm Fortran Compiler Language Reference guide for further information and an example.
NOVECTOR
The Fortran VECTOR ALWAYS directive allows you to request that a loop be auto-vectorized, irrespective of the compiler's internal cost-model, if it is safe to do so. Refer to Directives chapter of Arm Fortran Compiler Language Reference guide for further information and an example.
VECTOR ALWAYS
The 19.0 Fortran Compiler has better support for Fortran language standards (both Fortran 2003 and 2008). Specifically
ArmPL provides optimized standard core math libraries for high-performance computing applications on Arm processors. Arm Compiler for HPC 19.0 introduces a new compiler flag -armpl, which makes these libraries significantly easier to use.
-armpl
In order to use ArmPL, you need to make three decisions:
We'll discuss these decisions in turn.
ArmPL provides tuned implementations for each supported CPU architecture, plus a generic implementation that will execute on any supported Armv8-A computer.
generic
When using the -armpl flag, you must indicate your choice of implementation using the -mcpu=<arg> flag, using the following options:
-mcpu=<arg>
For the majority of HPC users, performance is more important than code portability, and the system on which you are compiling has the same CPU as the system on which you are executing. If so, we recommend you use -mcpu=native.
-mcpu=native
This decision will affect both the compiler-generated code and the choice of Arm Performance libraries.
ArmPL provides two interfaces:
For many users, 32-bit integers are sufficient, and allows better use of caches and memory bandwidth.
Fortran users who specify the -i8 flag (which indicates that the Fortran INTEGER type should be promoted to 64-bits) will almost certainly want to use the ilp64 interface. Therefore, the presence of the -i8 flag changes the default ArmPL interface to ilp64.
-i8
INTEGER
ilp64
To explicitly define the ArmPL interface, specify it as an argument to the -armpl flag, for example, -armpl=ilp64.
-armpl=ilp64
ArmPL provides two variants:
If you are using OpenMP for your own code (by using the -fopenmp flag), parallel is the default.If you are not using OpenMP for your own code, sequential is the default.
-fopenmp
parallel
sequential
To explicitly define the ArmPL implementation, specify it as an argument to the --armpl flag, for example, --armpl=parallel.
--armpl
--armpl=parallel
Multiple comma-separated arguments can be supplied to -armpl, for example, -armpl=parallel,ilp64.
-armpl=parallel,ilp64
The -armpl and -mcpu flags enable the compiler to find appropriate ArmPL header files (during compilation) and libraries (during linking).
-mcpu
Note: If your build process compiles and links as two separate steps, please ensure you add the same -armpl and -mcpu options to both.
Additional benefits of the -armpl option
The -armpl option also enables a number of further optimizations:
armflang -i8 -armpl -mcpu=native
armclang -armpl -fopenmp -mcpu=generic
armclang -armpl=parallel,ilp64 -mcpu=cortex-a72
Further examples of how to use the -armpl flag can be found in the Arm Performance Libraries Getting Started Guide
Guides for building MVAPICH 2.3 and OpenUCX are now available on developer.arm.com.
MVAPICH2 is a performant, scalable and fault-tolerant implementation of the MPI 3.1 standard for high-end computing systems and is able to exploit a wide range of networking technologies, including InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE.
OpenUCX is a collaboration between industry, national laboratories, and academia to create an open-source communication framework for data-centric and high-performance applications. The new guide on developer.arm.com details how to build OpenUCX with the Arm HPC Compiler, and how to configure Open MPI to adopt the OpenUCX transport layer. More information about both projects can be found on the Open MPI and OpenUCX web pages, respectively.
Arm Allinea Studio ships with GNU 8.2 compiler with support for C/C++ and Fortran applications. GNU 8 brings many performance improvements for AArch64 including SVE support in addition to generic improvements over GNU 7. GNU 8 change log page provides detailed information on the changes.
If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. The vast majority of requests are answered within a single working day. The revamped HPC ecosystem pages also have valuable information to get you started on Arm.
We are excited to announce the availability of Arm Allinea Studio 19.0 with major enhancements to compiler and libraries. Please get in touch to request a trial or buy a license. We plan to provide the next major release 19.1 in Feb 2019 with more features and improvements.
Download Arm Allinea Studio 19.0