Hi guys:
I'm an developing an opencl application on MTK P60(Mali G72 mp3). But i have met some problems.
The application has been run successfully on snapdragon 660(GPU Adreno 512), the performance was about 10ms. But when I run it on Mali G72 mp3, it should cost 60ms! When I check the gpu_utilization, it's almost 100 percent.
Firstly, I couldn't find any specification about the flops performance with the Mali G72.(Adreno 512 GPU flops performance: 255 Gflops)
Secondly, according to benchmarks, performance of G72 mp3 should close to the Adreno 512. I can't find out why it should perform so bad on G72 mp3.
Welcome to talk about this. :)
That's right. I enqueue more than 1 hundred kernels to the queue as one pass and cycle it. But 80% of them are very small kernels .(like relu and sum operation in CNN) And several convolution kernels costs 80% of the time.
Peter Harris said:very small kernels which are not able to parallelize and fully load the GPU because they are so small with a low thread count
I am not quiet understand those words mean. When kernels are small and GPU cycles counter is high, will it affect the GPU load? I have tuned their work group size, and each small kernel can dispatch hundreds of threads. How could the GPU core is not fully loaded?