Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimised GPU convolution for low memory integrated devices -such as arm processors /GPUs?

I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.

Parents
  • Hi,

    Here are a few leads:

    1. Have you considered using the Arm Compute Library [1]? It supports a number of convolution kernels optimised for Mali GPUs. We'd love to hear if your use-case isn't covered or if the library isn't convenient to use for some reason.

    2. You could try to use sub group operations to exchange data in a direct implementation.

    Hope this helps.

    Regards,

    Kévin

    [1] github.com/.../ComputeLibrary

Reply
  • Hi,

    Here are a few leads:

    1. Have you considered using the Arm Compute Library [1]? It supports a number of convolution kernels optimised for Mali GPUs. We'd love to hear if your use-case isn't covered or if the library isn't convenient to use for some reason.

    2. You could try to use sub group operations to exchange data in a direct implementation.

    Hope this helps.

    Regards,

    Kévin

    [1] github.com/.../ComputeLibrary

Children
  • Hi, I am primarily working with OpenCL 1.2 and subgroups is not supported till OpenCL 2.0. Another thing is I wish to know about the implementation details regarding the best way to do convolution in terms of memory and performance. I am primarily concerned with single kernel convolution. Kindly help.