I recently compared Sparse matrix vector multiplication (SpMV) performance of armpl library and native implementation.
The performance of armpl_spmv_exec_d function is the same as native implementation for a sparse matrix in CSR format with dimension of 16M x 16M.
Is that expected?
The compilation is with arm compiler for linux (acfl) with flag '-Ofast -mcpu=native' to enable sve and fast math.
-Ofast -mcpu=native' to enable sve and fast math.
The problem with parallelising using MPI for cases such as this is that you have effectively created 'nproc' separate problems, each with their own sparsity pattern, which need soling separately. Depending on these patterns there may well be an imbalance in work needed to solve each of them, meaning that the overall solution time is limited by the slowest of these.
The second issue that you run into is that by reducing the size of the matrix on each process you have limited the amount of optimizations that the library is able perform in its 'armpl_spmat_optimize()' phase. What's happening there is that the matrix format is being rearranged to something that can perform better. However, by limiting the number of rows available you are potentially decreasing the applicability or benefits of some of those optimizations.
When you say your parallel performance is similar to using a single thread, I'm more inclined to believe it is sparsity structure related as you should still get a parallelisation speed-up commensurate with the number of processors used if each of those partitioned matrices has a similar cost. Watching out for a "full" (dense) row (sometimes seen in the first or last row of the matrix) would be an example of a case that could give more serial-like performance.
An alternative experiment which could be useful would be to try using the OpenMP version of Arm Performance Libraries, where you allowed the threaded version to handle the SpMV on many threads. This means changing your partitioning of the matrix to fewer, bigger submatrices, but should give more opportunities for both optimized implementations and scalable parallelisation.
I hope these ideas are useful.