Hi guys:
I'm an developing an opencl application on MTK P60(Mali G72 mp3). But i have met some problems.
The application has been run successfully on snapdragon 660(GPU Adreno 512), the performance was about 10ms. But when I run it on Mali G72 mp3, it should cost 60ms! When I check the gpu_utilization, it's almost 100 percent.
Firstly, I couldn't find any specification about the flops performance with the Mali G72.(Adreno 512 GPU flops performance: 255 Gflops)
Secondly, according to benchmarks, performance of G72 mp3 should close to the Adreno 512. I can't find out why it should perform so bad on G72 mp3.
Welcome to talk about this. :)
Small work groups are not the problem - the problem is the small overall task size. Each kernel is only able to spawn 32 threads, and Mali GPUs have a finite number of concurrent compute dispatches which can be running simultaneously (exactly how many varies, but 8 is a good rule of thumb). 32 * 8 = 256 threads, which is only about 20% of the total thread capacity of the GPU you are using.
The only advice I can give for Mali is to design your algorithm to have fewer small kernels, or interleave them with much larger ones. You're aiming to have > 1000 threads running.