This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL strange  Performance Numbers -Mali T760 MP8

I have a fairly complex OpenCL implementation with 2D NDRange as follows:

Num of Work Groups - {10,7}

Work Group Size {64,1},

With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710

Below is the different number.

{10,7} - 0.625 secs

{10,4} - 0.710 secs

{10,3} - 0.759 secs

{10,2} - 0.826 secs

{10,1} - 0.185 secs  (this is less as expected)


This seems to be strange for me as the taken should have been less.

I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.


With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve  3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.

I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.


Can anyone tell me the reason behind this? Or am I missing something.

Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.

Platform details:

MALI T760 MP8 on Exynos 7420 platform

Mobile - Galaxy S6

Thanks in Advance

Banger.