Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.
Hi,
Here are a few leads:
1. Have you considered using the Arm Compute Library [1]? It supports a number of convolution kernels optimised for Mali GPUs. We'd love to hear if your use-case isn't covered or if the library isn't convenient to use for some reason.
2. You could try to use sub group operations to exchange data in a direct implementation.
Hope this helps.
Regards,
Kévin
[1] github.com/.../ComputeLibrary
Hi, I am primarily working with OpenCL 1.2 and subgroups is not supported till OpenCL 2.0. Another thing is I wish to know about the implementation details regarding the best way to do convolution in terms of memory and performance. I am primarily concerned with single kernel convolution. Kindly help.