spmv is a sparse matrix A multiple a dense vector B and get a dense vector C : C= A*B
I use CSR sparse matrix format, but the result even slower than the same size dense matrix multiple a dense vector.
I read same paper and open source library(CLSPARSE),most of them optimizing for AMD and NVIDIA GPU,not for MALI GPU。 The MALI gpu don't use warp to excute thread ,so optimize code by warp may not useful for MAlI GPU。
Some paper use BCSR(block csr) to enable acess memory cache friendly.
May be can use share momory or vectorization(float4 /float8/float16),who did this optimization ,please give some advice.
Thanks for citing our work :)