This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL strange  Performance Numbers -Mali T760 MP8

I have a fairly complex OpenCL implementation with 2D NDRange as follows:

Num of Work Groups - {10,7}

Work Group Size {64,1},

With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710

Below is the different number.

{10,7} - 0.625 secs

{10,4} - 0.710 secs

{10,3} - 0.759 secs

{10,2} - 0.826 secs

{10,1} - 0.185 secs  (this is less as expected)


This seems to be strange for me as the taken should have been less.

I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.


With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve  3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.

I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.


Can anyone tell me the reason behind this? Or am I missing something.

Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.

Platform details:

MALI T760 MP8 on Exynos 7420 platform

Mobile - Galaxy S6

Thanks in Advance

Banger.

Parents
  • Hi ravibanger,

    The reason the time increases is because you don't dispatch enough threads:

    In theory: you can execute up to 256 threads at the same time on a core, you've got 8 cores therefore if you dispatch 2000 threads all the threads will execute in parallel

    In practice: If the number of threads is close to the limit, the driver will consider it's not worth turning on all the cores and will instead serialise the jobs on a smaller number of cores.

    So, I'm afraid the only solution is to dispatch more threads.

    Regarding the static analysis: All Mali registers are 128 bits (There is no split between scalar / vector). We don't currently provide any static analysis tools.

    However you can use DS-5 Streamline to see how busy the GPU is and how filled the 3 pipes are ( L/S, ALU, Texture).

    Hope this helps,

Reply
  • Hi ravibanger,

    The reason the time increases is because you don't dispatch enough threads:

    In theory: you can execute up to 256 threads at the same time on a core, you've got 8 cores therefore if you dispatch 2000 threads all the threads will execute in parallel

    In practice: If the number of threads is close to the limit, the driver will consider it's not worth turning on all the cores and will instead serialise the jobs on a smaller number of cores.

    So, I'm afraid the only solution is to dispatch more threads.

    Regarding the static analysis: All Mali registers are 128 bits (There is no split between scalar / vector). We don't currently provide any static analysis tools.

    However you can use DS-5 Streamline to see how busy the GPU is and how filled the 3 pipes are ( L/S, ALU, Texture).

    Hope this helps,

Children