spmv is a sparse matrix A multiple a dense vector B and get a dense vector C : C= A*B
I use CSR sparse matrix format, but the result even slower than the same size dense matrix multiple a dense vector.
I read same paper and open source library(CLSPARSE),most of them optimizing for AMD and NVIDIA GPU,not for MALI GPU。 The MALI gpu don't use warp to excute thread ,so optimize code by warp may not useful for MAlI GPU。
Some paper use BCSR(block csr) to enable acess memory cache friendly.
May be can use share momory or vectorization(float4 /float8/float16),who did this optimization ,please give some advice.
gux написал(а):...May be can use share momory or vectorization(float4 /float8/float16)...
gux написал(а):
...
May be can use share momory or vectorization(float4 /float8/float16)...
As long as the guru did not answered - I advise you to read this document for the time being.
Thanks for citing our work :)