spmv is a sparse matrix A multiple a dense vector B and get a dense vector C : C= A*B
I use CSR sparse matrix format, but the result even slower than the same size dense matrix multiple a dense vector.
I read same paper and open source library(CLSPARSE),most of them optimizing for AMD and NVIDIA GPU,not for MALI GPU。 The MALI gpu don't use warp to excute thread ,so optimize code by warp may not useful for MAlI GPU。
Some paper use BCSR(block csr) to enable acess memory cache friendly.
May be can use share momory or vectorization(float4 /float8/float16),who did this optimization ,please give some advice.
There are lots of sparse formats available. We even did some research back in 2010 on this with a PhD student from Edinburgh (http://dl.acm.org/citation.cfm?id=1964196) Unfortunately, when he did his internship at ARM, the first Midgard GPU Mali-T604 was only being developed, so we ended up running experiments on NVIDIA and AMD platforms.
I believe this would still be interesting to study today using a framework for benchmarking and optimisation such as Collective Knowledge (cknowledge.org)