Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Activity
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • More
  • Cancel
High Performance Computing
  • Developer Community
  • Tools and Software
  • High Performance Computing
  • Jump...
  • Cancel
High Performance Computing
HPC blog Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks
  • HPC blog
  • HPC forum
  • Server & HPC events
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
More blogs in High Performance Computing
  • HPC blog

Tags
  • High Performance Computing (HPC)
  • arm performance libraries
  • Arm Fortran Compiler
  • HPC Compiler
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Allinea Studio 19.3: performance improvements in preparation for SC'19 benchmarks

Patrick Wohlschlegel
Patrick Wohlschlegel
September 26, 2019

Arm Allinea Studio 19.3 is now available.

This release adds major new features for our customers to test and benchmark, ready for presentations and workshops at SC in November. Highlights of this release include:

  • SVE enabled performance libraries, allowing customers to start preparing for SVE enabled hardware and testing their hardware models.
  • Support for RHEL 8 and SLES 15 operating systems
  • Offline HTML copy of the user documentation now packaged with the product, allowing users to read documentation without having to jump firewalls to access the Internet.
  • Improvements to auto-vectorization, including Neon reduction loops, Fortran loops with calls to math routines, and C/C++ routines with calls to sincos.
  • Field quality improvements, with numerous bug fixes implemented.

Introduction of SVE library

Arm Performance Libraries now contains libraries featuring Scalable Vector Extension (SVE) instructions. The SVE-enabled version has not been tuned for any particular microarchitecture, and is available to experiment with SVE in an emulated mode, ahead of silicon deployments.

We recommend using Arm instruction emulator (ArmIE) to execute programs containing SVE instructions.

To link to the SVE libraries it is possible to specify simply -armpl=sve with the Arm compiler. For example, to link an executable:

armflang -armpl=sve -lm driver.o -o driver.exe

And to run the executable using armie, emulating a core with 512-bit vector units:

armie -msve-vector-bits=512 ./driver.exe

A full set of examples is provided in the release.

New support for Sparse Matrix-Matrix multiplication (SpMM)

The 19.3 release sees the introduction of a new set of sparse routines for Sparse Matrix-Matrix multiplication which complements our existing routines for Sparse Matrix-Vector multiplication (SpMV). The new functionality is available in both C and Fortran along with examples and full documentation.

These routines are yet to be optimized to provide the best performance. Work for this release has concentrated on designing the interfaces and providing a functionally correct implementation. Optimizations are due to be included in future releases.

As for SpMV, we support matrices that are provided in Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Coordinate list (COO) formats. In addition, we also allow users to create sparse matrix handles for dense matrices. This allows for the multiplication of a sparse matrix by a dense matrix (or even the multiplication of two dense matrices, in which case we call the appropriate BLAS *GEMM routine, as expected).

The operation that is performed by our SpMM is the same as the dense *GEMM equivalent:

C := α op(A) op(B) + β C

where op indicates an optional transpose operation, and alpha and beta are scalars.

For convenience, we also provide a non-destructive SpADD operation (i.e. where the add operands are not overwritten):

C := α op(A) + β op(B)

We allow the creation of two special matrices: the null matrix (of all zeros) and the multiplicative identity matrix (a unit diagonal and zeros elsewhere) to optimize the trivial forms of these operations.

The API follows the same workflow as our SpMV functionality: create, hint, optimize, execute, destroy. For SpMM we create handles for the three matrices, for example:

armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat_a, M, K, row_ptr_a, col_indx_a, vals_a, creation_flags); // Similarly for B and C

We will then optionally provide hints about the structure and usage of each matrix:

info = armpl_spmat_hint(armpl_mat_a, ARMPL_SPARSE_HINT_STRUCTURE, ARMPL_SPARSE_STRUCTURE_UNSTRUCTURED); // Similarly for B and C

Before optimizing the SpMM operation, wherein new optimized data structures may be created:

info = armpl_spmm_optimize(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_a, armpl_mat_b, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_c);

Execution then populates matrix C with the result:

info = armpl_spmm_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, alpha, armpl_mat_a, armpl_mat_b, beta, armpl_mat_c);

Since matrix C is hidden behind the opaque handle armpl_mat_c, we must export the result back into CSR form:

info = armpl_spmat_export_csr_d(armpl_mat_c, 0, &nrows_c, &ncols_c, &out_row_ptr_c, &out_col_indx_c, &out_vals_c);

Finally, we clean up by destroying the matrix handles:

info = armpl_spmat_destroy(armpl_mat_a); // Similarly for B and C

SpMV parallel performance improvements

The performance of Sparse Matrix-Vector multiplication (SpMV) has been significantly improved for some problems when run in parallel. For example, the following selection of matrices from the Florida sparse matrix collection shows up to four times improvements in performance that is compared with the 19.2 Arm Performance Libraries release on a ThunderX2.

FFT performance improvements

We have improved the performance of our Fast Fourier Transform routines, particularly for small transform lengths (n<20) and transform lengths with prime factors. Our performance now compares favorably with FFTW, particularly for complex-to-complex transforms. As illustrated in the following graph for the single-precision case on a ThunderX2, where most of the results fall above the bold line y=1. This indicates faster performance with Arm PL 19.3 than FFTW 3.3.8. (FFTW was compiled with GCC 8.2 and configured with --enable-neon --enable-fma).

A graph to show Arm performance libraries

We have also enabled shared memory parallelism for multi-dimensional problems. We attain 80% parallel efficiency for a 3-d problem of dimensions that are 70x70x70 on a ThunderX2:

A graph to show parallel efficiency

Support

If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.

Conclusion

I am excited to announce the availability of Arm Allinea Studio 19.3 with major enhancements to compiler and libraries. We plan to provide the next major release 20.0 towards the end of November 2019, with more features and improvements.

Request a trial
Buy a license

Anonymous
HPC blog
  • HPC blog: SC19 Recap: Arm’s HPC Rocky Mountain Highlights

    Darren Cepulis
    Darren Cepulis
    Read about the Arm HPC software ecosystem highlights at the Supercomputing conference (SC19).
    • December 4, 2019
  • HPC blog: Optimizing a NVIDIA CUDA ML Inference Application with Arm Forge

    David Lecomber
    David Lecomber
    Read about NVIDIA GPUs for Arm servers, the Arm Forge team is excited to be bringing its leading developer tools to support this platform too.
    • November 1, 2019
  • HPC blog: Arm on Arm: Cadence Characterization in the AWS Cloud

    Darren Cepulis
    Darren Cepulis
    Read about the explosion of new use cases across numerous businesses and areas of research and how Cadence characterizes the AWS cloud.
    • October 20, 2019