Optimised GPU convolution for low memory integrated devices -such as arm processors /GPUs?

I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.

Parents
  • Hi,

    Here are a few leads:

    1. Have you considered using the Arm Compute Library [1]? It supports a number of convolution kernels optimised for Mali GPUs. We'd love to hear if your use-case isn't covered or if the library isn't convenient to use for some reason.

    2. You could try to use sub group operations to exchange data in a direct implementation.

    Hope this helps.

    Regards,

    Kévin

    [1] github.com/.../ComputeLibrary

Reply
  • Hi,

    Here are a few leads:

    1. Have you considered using the Arm Compute Library [1]? It supports a number of convolution kernels optimised for Mali GPUs. We'd love to hear if your use-case isn't covered or if the library isn't convenient to use for some reason.

    2. You could try to use sub group operations to exchange data in a direct implementation.

    Hope this helps.

    Regards,

    Kévin

    [1] github.com/.../ComputeLibrary

Children
More questions in this forum