Arm Allinea Studio 19.3 is now available.
This release adds major new features for our customers to test and benchmark, ready for presentations and workshops at SC in November. Highlights of this release include:
Arm Performance Libraries now contains libraries featuring Scalable Vector Extension (SVE) instructions. The SVE-enabled version has not been tuned for any particular microarchitecture, and is available to experiment with SVE in an emulated mode, ahead of silicon deployments.
We recommend using Arm instruction emulator (ArmIE) to execute programs containing SVE instructions.
To link to the SVE libraries it is possible to specify simply -armpl=sve with the Arm compiler. For example, to link an executable:
armflang -armpl=sve -lm driver.o -o driver.exe
And to run the executable using armie, emulating a core with 512-bit vector units:
armie -msve-vector-bits=512 ./driver.exe
A full set of examples is provided in the release.
The 19.3 release sees the introduction of a new set of sparse routines for Sparse Matrix-Matrix multiplication which complements our existing routines for Sparse Matrix-Vector multiplication (SpMV). The new functionality is available in both C and Fortran along with examples and full documentation.
These routines are yet to be optimized to provide the best performance. Work for this release has concentrated on designing the interfaces and providing a functionally correct implementation. Optimizations are due to be included in future releases.
As for SpMV, we support matrices that are provided in Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Coordinate list (COO) formats. In addition, we also allow users to create sparse matrix handles for dense matrices. This allows for the multiplication of a sparse matrix by a dense matrix (or even the multiplication of two dense matrices, in which case we call the appropriate BLAS *GEMM routine, as expected).
The operation that is performed by our SpMM is the same as the dense *GEMM equivalent:
C := α op(A) op(B) + β C
where op indicates an optional transpose operation, and alpha and beta are scalars.
For convenience, we also provide a non-destructive SpADD operation (i.e. where the add operands are not overwritten):
C := α op(A) + β op(B)
We allow the creation of two special matrices: the null matrix (of all zeros) and the multiplicative identity matrix (a unit diagonal and zeros elsewhere) to optimize the trivial forms of these operations.
The API follows the same workflow as our SpMV functionality: create, hint, optimize, execute, destroy. For SpMM we create handles for the three matrices, for example:
armpl_status_t info = armpl_spmat_create_csr_d(&armpl_mat_a, M, K, row_ptr_a, col_indx_a, vals_a, creation_flags); // Similarly for B and C
We will then optionally provide hints about the structure and usage of each matrix:
info = armpl_spmat_hint(armpl_mat_a, ARMPL_SPARSE_HINT_STRUCTURE, ARMPL_SPARSE_STRUCTURE_UNSTRUCTURED); // Similarly for B and C
Before optimizing the SpMM operation, wherein new optimized data structures may be created:
info = armpl_spmm_optimize(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_a, armpl_mat_b, ARMPL_SPARSE_SCALAR_ONE, armpl_mat_c);
Execution then populates matrix C with the result:
info = armpl_spmm_exec_d(ARMPL_SPARSE_OPERATION_NOTRANS, ARMPL_SPARSE_OPERATION_NOTRANS, alpha, armpl_mat_a, armpl_mat_b, beta, armpl_mat_c);
Since matrix C is hidden behind the opaque handle armpl_mat_c, we must export the result back into CSR form:
info = armpl_spmat_export_csr_d(armpl_mat_c, 0, &nrows_c, &ncols_c, &out_row_ptr_c, &out_col_indx_c, &out_vals_c);
Finally, we clean up by destroying the matrix handles:
info = armpl_spmat_destroy(armpl_mat_a); // Similarly for B and C
The performance of Sparse Matrix-Vector multiplication (SpMV) has been significantly improved for some problems when run in parallel. For example, the following selection of matrices from the Florida sparse matrix collection shows up to four times improvements in performance that is compared with the 19.2 Arm Performance Libraries release on a ThunderX2.
We have improved the performance of our Fast Fourier Transform routines, particularly for small transform lengths (n<20) and transform lengths with prime factors. Our performance now compares favorably with FFTW, particularly for complex-to-complex transforms. As illustrated in the following graph for the single-precision case on a ThunderX2, where most of the results fall above the bold line y=1. This indicates faster performance with Arm PL 19.3 than FFTW 3.3.8. (FFTW was compiled with GCC 8.2 and configured with --enable-neon --enable-fma).
We have also enabled shared memory parallelism for multi-dimensional problems. We attain 80% parallel efficiency for a 3-d problem of dimensions that are 70x70x70 on a ThunderX2:
If you have questions, doubts or want to raise an issue either email HPC software support or visit the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm.
I am excited to announce the availability of Arm Allinea Studio 19.3 with major enhancements to compiler and libraries. We plan to provide the next major release 20.0 towards the end of November 2019, with more features and improvements.
Request a trialBuy a license