Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • arm performance libraries
  • Arm Fortran Compiler
  • HPC Compiler
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks

Patrick Wohlschlegel
Patrick Wohlschlegel
September 26, 2019

Arm Allinea Studio 19.3 is now available.

This release adds major new features for our customers to test and benchmark, ready for presentations and workshops at SC in November. Highlights of this release include:

  • SVE enabled performance libraries, allowing customers to start preparing for SVE enabled hardware and testing their hardware models.
  • Support for RHEL 8 and SLES 15 operating systems
  • Offline HTML copy of the user documentation now packaged with the product, allowing users to read documentation without having to jump firewalls to access the Internet.
  • Improvements to auto-vectorization, including Neon reduction loops, Fortran loops with calls to math routines, and C/C++ routines with calls to sincos.
  • Field quality improvements, with numerous bug fixes implemented.

Introduction of SVE library

Arm Performance Libraries now contains libraries featuring Scalable Vector Extension (SVE) instructions. The SVE-enabled version has not been tuned for any particular microarchitecture, and is available to experiment with SVE in an emulated mode, ahead of silicon deployments.

We recommend using Arm instruction emulator (ArmIE) to execute programs containing SVE instructions.

To link to the SVE libraries it is possible to specify simply -armpl=sve with the Arm compiler. For example, to link an executable:

armflang -armpl=sve -lm driver.o -o driver.exe

And to run the executable using armie, emulating a core with 512-bit vector units:

armie -msve-vector-bits=512 ./driver.exe

A full set of examples is provided in the release.

New support for Sparse Matrix-Matrix multiplication (SpMM)

The 19.3 release sees the introduction of a new set of sparse routines for Sparse Matrix-Matrix multiplication which complements our existing routines for Sparse Matrix-Vector multiplication (SpMV). The new functionality is available in both C and Fortran along with examples and full documentation.

These routines are yet to be optimized to provide the best performance. Work for this release has concentrated on designing the interfaces and providing a functionally correct implementation. Optimizations are due to be included in future releases.

As for SpMV, we support matrices that are provided in Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Coordinate list (COO) formats. In addition, we also allow users to create sparse matrix handles for dense matrices. This allows for the multiplication of a sparse matrix by a dense matrix (or even the multiplication of two dense matrices, in which case we call the appropriate BLAS *GEMM routine, as expected).

The operation that is performed by our SpMM is the same as the dense *GEMM equivalent:

C := α op(A) op(B) + β C

where op indicates an optional transpose operation, and alpha and beta are scalars.

For convenience, we also provide a non-destructive SpADD operation (i.e. where the add operands are not overwritten):

C := α op(A) + β op(B)

We allow the creation of two special matrices: the null matrix (of all zeros) and the multiplicative identity matrix (a unit diagonal and zeros elsewhere) to optimize the trivial forms of these operations.

The API follows the same workflow as our SpMV functionality: create, hint, optimize, execute, destroy. For SpMM we create handles for the three matrices, for example:

armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat_a, M, K, row_ptr_a, col_indx_a, vals_a, creation_flags); // Similarly for B and C

We will then optionally provide hints about the structure and usage of each matrix:

info = armpl_spmat_hint(armpl_mat_a, ARMPL_SPARSE_HINT_STRUCTURE, ARMPL_SPARSE_STRUCTURE_UNSTRUCTURED); // Similarly for B and C

Before optimizing the SpMM operation, wherein new optimized data structures may be created:

info = armpl_spmm_optimize(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_a, armpl_mat_b, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_c);

Execution then populates matrix C with the result:

info = armpl_spmm_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, alpha, armpl_mat_a, armpl_mat_b, beta, armpl_mat_c);

Since matrix C is hidden behind the opaque handle armpl_mat_c, we must export the result back into CSR form:

info = armpl_spmat_export_csr_d(armpl_mat_c, 0, &nrows_c, &ncols_c, &out_row_ptr_c, &out_col_indx_c, &out_vals_c);

Finally, we clean up by destroying the matrix handles:

info = armpl_spmat_destroy(armpl_mat_a); // Similarly for B and C

SpMV parallel performance improvements

The performance of Sparse Matrix-Vector multiplication (SpMV) has been significantly improved for some problems when run in parallel. For example, the following selection of matrices from the Florida sparse matrix collection shows up to four times improvements in performance that is compared with the 19.2 Arm Performance Libraries release on a ThunderX2.

FFT performance improvements

We have improved the performance of our Fast Fourier Transform routines, particularly for small transform lengths (n<20) and transform lengths with prime factors. Our performance now compares favorably with FFTW, particularly for complex-to-complex transforms. As illustrated in the following graph for the single-precision case on a ThunderX2, where most of the results fall above the bold line y=1. This indicates faster performance with Arm PL 19.3 than FFTW 3.3.8. (FFTW was compiled with GCC 8.2 and configured with --enable-neon --enable-fma).

A graph to show Arm performance libraries

We have also enabled shared memory parallelism for multi-dimensional problems. We attain 80% parallel efficiency for a 3-d problem of dimensions that are 70x70x70 on a ThunderX2:

A graph to show parallel efficiency

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 19.3 with major enhancements to compiler and libraries. We plan to provide the next major release 20.0 towards the end of November 2019, with more features and improvements.

Request a trial
Buy a license

Anonymous
High Performance Computing (HPC) blog
  • Key workloads demonstrate how Arm servers are changing HPC

    David Lecomber
    David Lecomber
    In the blog we look at the progress made in the Arm HPC application ecosystem and give a preview of our activities at ISC'22.
    • May 24, 2022
  • Arm Compilers and Performance Libraries for HPC developers now available for free

    Ashok Bhat
    Ashok Bhat
    Arm C/C++/Fortran Compilers and Arm Performance Libraries, aimed at HPC application developers, are now available for free. You no longer need license files to use the tools.
    • May 9, 2022
  • Stoking the Fire in Arm HPC

    David Lecomber
    David Lecomber
    In this blog we look at the growth of Arm in HPC - from humble beginnings to the number one ranked supercomputer in the world
    • May 3, 2022