Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hi guys:
I'm an developing an opencl application on MTK P60(Mali G72 mp3). But i have met some problems.
The application has been run successfully on snapdragon 660(GPU Adreno 512), the performance was about 10ms. But when I run it on Mali G72 mp3, it should cost 60ms! When I check the gpu_utilization, it's almost 100 percent.
Firstly, I couldn't find any specification about the flops performance with the Mali G72.(Adreno 512 GPU flops performance: 255 Gflops)
Secondly, according to benchmarks, performance of G72 mp3 should close to the Adreno 512. I can't find out why it should perform so bad on G72 mp3.
Welcome to talk about this. :)
That's right. I enqueue more than 1 hundred kernels to the queue as one pass and cycle it. But 80% of them are very small kernels .(like relu and sum operation in CNN) And several convolution kernels costs 80% of the time.
Peter Harris said:very small kernels which are not able to parallelize and fully load the GPU because they are so small with a low thread count
I am not quiet understand those words mean. When kernels are small and GPU cycles counter is high, will it affect the GPU load? I have tuned their work group size, and each small kernel can dispatch hundreds of threads. How could the GPU core is not fully loaded?