I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.
Hi, I am primarily working with OpenCL 1.2 and subgroups is not supported till OpenCL 2.0. Another thing is I wish to know about the implementation details regarding the best way to do convolution in terms of memory and performance. I am primarily concerned with single kernel convolution. Kindly help.