I recently compared Sparse matrix vector multiplication (SpMV) performance of armpl library and native implementation.
The performance of armpl_spmv_exec_d function is the same as native implementation for a sparse matrix in CSR format with dimension of 16M x 16M.
Is that expected?
The compilation is with arm compiler for linux (acfl) with flag '-Ofast -mcpu=native' to enable sve and fast math.
-Ofast -mcpu=native' to enable sve and fast math.
Hi.
The Arm Performance Libraries SpMV implementation is specifically optimized around cases where the matrix is reused many times, rather than just a single invocation. This means that a longer time may be spent in the initial "armpl_spmv_optimize()" phase. The key here is to use the 'hints' system to let the library know that it is worthwhile doing this optimization. This is demonstrated in the example on:
https://developer.arm.com/documentation/101004/2202/Sparse-Linear-Algebra/Example-of-SpMV-usage
with the line
info = armpl_spmat_hint(armpl_mat, ARMPL_SPARSE_HINT_SPMV_INVOCATIONS, ARMPL_SPARSE_INVOCATIONS_MANY);
If you are just doing a single invocation of “armpl_spmv_exec_d()” without having done the "armpl_spmv_optimize()" call then it assumes that only a single call will be done, and hence a standard CSR method will be used. With a matrix of the size you are talking about you will then be limited by the available memory bandwidth on your system and hence getting similar performance is to be expected. Having done an optimize operation we tend to see between 10% -> 2x performance increases, but this will be dependent on the sparsity structure of your matrix. There are occasional cases where CSR-performance is as good as is achievable but those cases have very unfriendly sparsity structures.
Hope this helps.
Chris
Hi,
Have you tried this with MPI? I did a test where each node handles a range of rows and there is no performance improvement as in single thread. Please advice.
Regards,
The problem with parallelising using MPI for cases such as this is that you have effectively created 'nproc' separate problems, each with their own sparsity pattern, which need soling separately. Depending on these patterns there may well be an imbalance in work needed to solve each of them, meaning that the overall solution time is limited by the slowest of these.
The second issue that you run into is that by reducing the size of the matrix on each process you have limited the amount of optimizations that the library is able perform in its 'armpl_spmat_optimize()' phase. What's happening there is that the matrix format is being rearranged to something that can perform better. However, by limiting the number of rows available you are potentially decreasing the applicability or benefits of some of those optimizations.
When you say your parallel performance is similar to using a single thread, I'm more inclined to believe it is sparsity structure related as you should still get a parallelisation speed-up commensurate with the number of processors used if each of those partitioned matrices has a similar cost. Watching out for a "full" (dense) row (sometimes seen in the first or last row of the matrix) would be an example of a case that could give more serial-like performance.
An alternative experiment which could be useful would be to try using the OpenMP version of Arm Performance Libraries, where you allowed the threaded version to handle the SpMV on many threads. This means changing your partitioning of the matrix to fewer, bigger submatrices, but should give more opportunities for both optimized implementations and scalable parallelisation.
I hope these ideas are useful.