This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali G72 mp3 flops performance

Hi guys:

  I'm an developing an opencl application on MTK P60(Mali G72 mp3). But i have met some problems.  

 The application has been run successfully on snapdragon 660(GPU Adreno 512), the performance was about 10ms. But when I run it on Mali G72 mp3, it should cost 60ms! When I check the gpu_utilization, it's almost 100 percent.

  Firstly, I couldn't find any specification about the flops performance with the Mali G72.(Adreno 512 GPU flops performance: 255 Gflops)

  Secondly, according to benchmarks, performance of G72 mp3 should close to the Adreno 512. I can't find out why it should perform so bad on G72 mp3.

  Welcome to talk about this. :)

 

Parents
  • I have tuned my work groups, and that dose not work. I think I need some other new ways.


    Mali G72 MP3 work group size Adreno 660 work group size work global size
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    1,  1,  2, 4,  1,  4,   8,  1,  4,
    1,  1,  2, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    4,  1,  1, 4,  4,  4,   8,  4,  4,
    1,  1,  1, 8,  1,  2,   8,  1,  4,
    4,  1,  1, 4,  4,  4,   8,  4,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,

    Above are the poor performance kernel groups' size on Aderno 660 and Mali G72. Even on 660, work group size is not big.   

    If you only have "hundreds" then you spend most of your time ramping up new workloads and then ramping down again ...

    What's exactly in GPU about ramping up workloads. Is it flush the cache or enqueue the kernel from CPU to GPU?  

    Is there any suggestion I can do to make full use of GPU to the small kernel? 

Reply
  • I have tuned my work groups, and that dose not work. I think I need some other new ways.


    Mali G72 MP3 work group size Adreno 660 work group size work global size
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    1,  1,  2, 4,  1,  4,   8,  1,  4,
    1,  1,  2, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,
    4,  1,  1, 4,  4,  4,   8,  4,  4,
    1,  1,  1, 8,  1,  2,   8,  1,  4,
    4,  1,  1, 4,  4,  4,   8,  4,  4,
    2,  1,  1, 4,  1,  4,   8,  1,  4,

    Above are the poor performance kernel groups' size on Aderno 660 and Mali G72. Even on 660, work group size is not big.   

    If you only have "hundreds" then you spend most of your time ramping up new workloads and then ramping down again ...

    What's exactly in GPU about ramping up workloads. Is it flush the cache or enqueue the kernel from CPU to GPU?  

    Is there any suggestion I can do to make full use of GPU to the small kernel? 

Children
  • Small work groups are not the problem - the problem is the small overall task size. Each kernel is only able to spawn 32 threads, and Mali GPUs have a finite number of concurrent compute dispatches which can be running simultaneously (exactly how many varies, but 8 is a good rule of thumb).  32 * 8 = 256 threads, which is only about 20% of the total thread capacity of the GPU you are using.

    The only advice I can give for Mali is to design your algorithm to have fewer small kernels, or interleave them with much larger ones. You're aiming to have > 1000 threads running.