This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL strange  Performance Numbers -Mali T760 MP8

I have a fairly complex OpenCL implementation with 2D NDRange as follows:

Num of Work Groups - {10,7}

Work Group Size {64,1},

With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710

Below is the different number.

{10,7} - 0.625 secs

{10,4} - 0.710 secs

{10,3} - 0.759 secs

{10,2} - 0.826 secs

{10,1} - 0.185 secs  (this is less as expected)


This seems to be strange for me as the taken should have been less.

I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.


With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve  3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.

I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.


Can anyone tell me the reason behind this? Or am I missing something.

Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.

Platform details:

MALI T760 MP8 on Exynos 7420 platform

Mobile - Galaxy S6

Thanks in Advance

Banger.

Parents
  • Hi ravibanger,

    The optimisation guide is a bit out of date, sorry about that: quite often for 2D or 3D workgroup sizes providing a local worgroup size to clEnqueueNDRangeKernel will have a significant impact on performance because it will change the order in which threads are dispatched which will change the way data gets cached.

    For example if you have a simple 2D bilateral filter that you split into a horizontal and a vertical pass, you should see some significant change in performance for the vertical pass depending on if you use {128,1} / { 1, 128 } / { 4, 32} , etc.

    Unfortunately it's not always simple to predict so you will have to experiment to find what the best size is.

    Regarding the CL_OUT_OF_RESOURCES: if you don't specify a local workgroup size at compilation time, the compiler will not try to put pressure on the registers and you might end up only being able to use a LWS of 64.

    However if you specify the LWS in your kernel like this:

    __kernel __attribute__((reqd_work_group_size(128,1,1))) void my_kernel(__global uchar *foo)
    

    Then the compiler will does what it takes for you to be able to use a LWS of 128.

    Most of the time it means it will use spilling to global memory in order to release the register pressure, which obviously has a cost but the benefit of having 128 threads running instead of 64 should outweigh it.

    Hope this helps,

Reply
  • Hi ravibanger,

    The optimisation guide is a bit out of date, sorry about that: quite often for 2D or 3D workgroup sizes providing a local worgroup size to clEnqueueNDRangeKernel will have a significant impact on performance because it will change the order in which threads are dispatched which will change the way data gets cached.

    For example if you have a simple 2D bilateral filter that you split into a horizontal and a vertical pass, you should see some significant change in performance for the vertical pass depending on if you use {128,1} / { 1, 128 } / { 4, 32} , etc.

    Unfortunately it's not always simple to predict so you will have to experiment to find what the best size is.

    Regarding the CL_OUT_OF_RESOURCES: if you don't specify a local workgroup size at compilation time, the compiler will not try to put pressure on the registers and you might end up only being able to use a LWS of 64.

    However if you specify the LWS in your kernel like this:

    __kernel __attribute__((reqd_work_group_size(128,1,1))) void my_kernel(__global uchar *foo)
    

    Then the compiler will does what it takes for you to be able to use a LWS of 128.

    Most of the time it means it will use spilling to global memory in order to release the register pressure, which obviously has a cost but the benefit of having 128 threads running instead of 64 should outweigh it.

    Hope this helps,

Children