Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks

September 26, 2019

4 minute read time.

Arm Allinea Studio 19.3 is now available.

This release adds major new features for our customers to test and benchmark, ready for presentations and workshops at SC in November. Highlights of this release include:

SVE enabled performance libraries, allowing customers to start preparing for SVE enabled hardware and testing their hardware models.
Support for RHEL 8 and SLES 15 operating systems
Offline HTML copy of the user documentation now packaged with the product, allowing users to read documentation without having to jump firewalls to access the Internet.
Improvements to auto-vectorization, including Neon reduction loops, Fortran loops with calls to math routines, and C/C++ routines with calls to sincos.
Field quality improvements, with numerous bug fixes implemented.

Introduction of SVE library

Arm Performance Libraries now contains libraries featuring Scalable Vector Extension (SVE) instructions. The SVE-enabled version has not been tuned for any particular microarchitecture, and is available to experiment with SVE in an emulated mode, ahead of silicon deployments.

We recommend using Arm instruction emulator (ArmIE) to execute programs containing SVE instructions.

To link to the SVE libraries it is possible to specify simply -armpl=sve with the Arm compiler. For example, to link an executable:

armflang -armpl=sve -lm driver.o -o driver.exe

And to run the executable using armie, emulating a core with 512-bit vector units:

armie -msve-vector-bits=512 ./driver.exe

A full set of examples is provided in the release.

New support for Sparse Matrix-Matrix multiplication (SpMM)

The 19.3 release sees the introduction of a new set of sparse routines for Sparse Matrix-Matrix multiplication which complements our existing routines for Sparse Matrix-Vector multiplication (SpMV). The new functionality is available in both C and Fortran along with examples and full documentation.

These routines are yet to be optimized to provide the best performance. Work for this release has concentrated on designing the interfaces and providing a functionally correct implementation. Optimizations are due to be included in future releases.

As for SpMV, we support matrices that are provided in Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Coordinate list (COO) formats. In addition, we also allow users to create sparse matrix handles for dense matrices. This allows for the multiplication of a sparse matrix by a dense matrix (or even the multiplication of two dense matrices, in which case we call the appropriate BLAS *GEMM routine, as expected).

The operation that is performed by our SpMM is the same as the dense *GEMM equivalent:

C := α op(A) op(B) + β C

where op indicates an optional transpose operation, and alpha and beta are scalars.

For convenience, we also provide a non-destructive SpADD operation (i.e. where the add operands are not overwritten):

C := α op(A) + β op(B)

We allow the creation of two special matrices: the null matrix (of all zeros) and the multiplicative identity matrix (a unit diagonal and zeros elsewhere) to optimize the trivial forms of these operations.

The API follows the same workflow as our SpMV functionality: create, hint, optimize, execute, destroy. For SpMM we create handles for the three matrices, for example:

armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat_a, M, K, row_ptr_a, col_indx_a, vals_a, creation_flags); // Similarly for B and C

We will then optionally provide hints about the structure and usage of each matrix:

info = armpl_spmat_hint(armpl_mat_a, ARMPL_SPARSE_HINT_STRUCTURE, ARMPL_SPARSE_STRUCTURE_UNSTRUCTURED); // Similarly for B and C

Before optimizing the SpMM operation, wherein new optimized data structures may be created:

info = armpl_spmm_optimize(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_a, armpl_mat_b, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_c);

Execution then populates matrix C with the result:

info = armpl_spmm_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, alpha, armpl_mat_a, armpl_mat_b, beta, armpl_mat_c);

Since matrix C is hidden behind the opaque handle armpl_mat_c, we must export the result back into CSR form:

info = armpl_spmat_export_csr_d(armpl_mat_c, 0, &nrows_c, &ncols_c, &out_row_ptr_c, &out_col_indx_c, &out_vals_c);

Finally, we clean up by destroying the matrix handles:

info = armpl_spmat_destroy(armpl_mat_a); // Similarly for B and C

SpMV parallel performance improvements

The performance of Sparse Matrix-Vector multiplication (SpMV) has been significantly improved for some problems when run in parallel. For example, the following selection of matrices from the Florida sparse matrix collection shows up to four times improvements in performance that is compared with the 19.2 Arm Performance Libraries release on a ThunderX2.

FFT performance improvements

We have improved the performance of our Fast Fourier Transform routines, particularly for small transform lengths (n<20) and transform lengths with prime factors. Our performance now compares favorably with FFTW, particularly for complex-to-complex transforms. As illustrated in the following graph for the single-precision case on a ThunderX2, where most of the results fall above the bold line y=1. This indicates faster performance with Arm PL 19.3 than FFTW 3.3.8. (FFTW was compiled with GCC 8.2 and configured with --enable-neon --enable-fma).

A graph to show Arm performance libraries

We have also enabled shared memory parallelism for multi-dimensional problems. We attain 80% parallel efficiency for a 3-d problem of dimensions that are 70x70x70 on a ThunderX2:

A graph to show parallel efficiency

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 19.3 with major enhancements to compiler and libraries. We plan to provide the next major release 20.0 towards the end of November 2019, with more features and improvements.

[CTAToken URL = "https://pages.arm.com/Hpc-trial-request.html" target="_blank" text="Request a trial" class ="green"]
[CTAToken URL = "https://store.developer.arm.com/store/high-performance-computing-hpc-tools/arm-allinea-studio" target="_blank" text="Buy a license" class ="green"]

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog