Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • arm performance libraries
  • Arm Fortran Compiler
  • HPC Compiler
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks

Patrick Wohlschlegel
Patrick Wohlschlegel
September 26, 2019
4 minute read time.

Arm Allinea Studio 19.3 is now available.

This release adds major new features for our customers to test and benchmark, ready for presentations and workshops at SC in November. Highlights of this release include:

  • SVE enabled performance libraries, allowing customers to start preparing for SVE enabled hardware and testing their hardware models.
  • Support for RHEL 8 and SLES 15 operating systems
  • Offline HTML copy of the user documentation now packaged with the product, allowing users to read documentation without having to jump firewalls to access the Internet.
  • Improvements to auto-vectorization, including Neon reduction loops, Fortran loops with calls to math routines, and C/C++ routines with calls to sincos.
  • Field quality improvements, with numerous bug fixes implemented.

Introduction of SVE library

Arm Performance Libraries now contains libraries featuring Scalable Vector Extension (SVE) instructions. The SVE-enabled version has not been tuned for any particular microarchitecture, and is available to experiment with SVE in an emulated mode, ahead of silicon deployments.

We recommend using Arm instruction emulator (ArmIE) to execute programs containing SVE instructions.

To link to the SVE libraries it is possible to specify simply -armpl=sve with the Arm compiler. For example, to link an executable:

armflang -armpl=sve -lm driver.o -o driver.exe

And to run the executable using armie, emulating a core with 512-bit vector units:

armie -msve-vector-bits=512 ./driver.exe

A full set of examples is provided in the release.

New support for Sparse Matrix-Matrix multiplication (SpMM)

The 19.3 release sees the introduction of a new set of sparse routines for Sparse Matrix-Matrix multiplication which complements our existing routines for Sparse Matrix-Vector multiplication (SpMV). The new functionality is available in both C and Fortran along with examples and full documentation.

These routines are yet to be optimized to provide the best performance. Work for this release has concentrated on designing the interfaces and providing a functionally correct implementation. Optimizations are due to be included in future releases.

As for SpMV, we support matrices that are provided in Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Coordinate list (COO) formats. In addition, we also allow users to create sparse matrix handles for dense matrices. This allows for the multiplication of a sparse matrix by a dense matrix (or even the multiplication of two dense matrices, in which case we call the appropriate BLAS *GEMM routine, as expected).

The operation that is performed by our SpMM is the same as the dense *GEMM equivalent:

C := α op(A) op(B) + β C

where op indicates an optional transpose operation, and alpha and beta are scalars.

For convenience, we also provide a non-destructive SpADD operation (i.e. where the add operands are not overwritten):

C := α op(A) + β op(B)

We allow the creation of two special matrices: the null matrix (of all zeros) and the multiplicative identity matrix (a unit diagonal and zeros elsewhere) to optimize the trivial forms of these operations.

The API follows the same workflow as our SpMV functionality: create, hint, optimize, execute, destroy. For SpMM we create handles for the three matrices, for example:

armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat_a, M, K, row_ptr_a, col_indx_a, vals_a, creation_flags); // Similarly for B and C

We will then optionally provide hints about the structure and usage of each matrix:

info = armpl_spmat_hint(armpl_mat_a, ARMPL_SPARSE_HINT_STRUCTURE, ARMPL_SPARSE_STRUCTURE_UNSTRUCTURED); // Similarly for B and C

Before optimizing the SpMM operation, wherein new optimized data structures may be created:

info = armpl_spmm_optimize(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_a, armpl_mat_b, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_c);

Execution then populates matrix C with the result:

info = armpl_spmm_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, alpha, armpl_mat_a, armpl_mat_b, beta, armpl_mat_c);

Since matrix C is hidden behind the opaque handle armpl_mat_c, we must export the result back into CSR form:

info = armpl_spmat_export_csr_d(armpl_mat_c, 0, &nrows_c, &ncols_c, &out_row_ptr_c, &out_col_indx_c, &out_vals_c);

Finally, we clean up by destroying the matrix handles:

info = armpl_spmat_destroy(armpl_mat_a); // Similarly for B and C

SpMV parallel performance improvements

The performance of Sparse Matrix-Vector multiplication (SpMV) has been significantly improved for some problems when run in parallel. For example, the following selection of matrices from the Florida sparse matrix collection shows up to four times improvements in performance that is compared with the 19.2 Arm Performance Libraries release on a ThunderX2.

FFT performance improvements

We have improved the performance of our Fast Fourier Transform routines, particularly for small transform lengths (n<20) and transform lengths with prime factors. Our performance now compares favorably with FFTW, particularly for complex-to-complex transforms. As illustrated in the following graph for the single-precision case on a ThunderX2, where most of the results fall above the bold line y=1. This indicates faster performance with Arm PL 19.3 than FFTW 3.3.8. (FFTW was compiled with GCC 8.2 and configured with --enable-neon --enable-fma).

A graph to show Arm performance libraries

We have also enabled shared memory parallelism for multi-dimensional problems. We attain 80% parallel efficiency for a 3-d problem of dimensions that are 70x70x70 on a ThunderX2:

A graph to show parallel efficiency

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 19.3 with major enhancements to compiler and libraries. We plan to provide the next major release 20.0 towards the end of November 2019, with more features and improvements.

Request a trial
Buy a license

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025