Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.
Hi, I am primarily working with OpenCL 1.2 and subgroups is not supported till OpenCL 2.0. Another thing is I wish to know about the implementation details regarding the best way to do convolution in terms of memory and performance. I am primarily concerned with single kernel convolution. Kindly help.