I have a fairly complex OpenCL implementation with 2D NDRange as follows:
Num of Work Groups - {10,7}
Work Group Size {64,1},
With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710
Below is the different number.
{10,7} - 0.625 secs
{10,4} - 0.710 secs
{10,3} - 0.759 secs
{10,2} - 0.826 secs
{10,1} - 0.185 secs (this is less as expected)
This seems to be strange for me as the taken should have been less.
I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.
With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve 3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.
I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.
Can anyone tell me the reason behind this? Or am I missing something.
Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.
Platform details:
MALI T760 MP8 on Exynos 7420 platform
Mobile - Galaxy S6
Thanks in Advance
Banger.
Hi ravibanger,
The optimisation guide is a bit out of date, sorry about that: quite often for 2D or 3D workgroup sizes providing a local worgroup size to clEnqueueNDRangeKernel will have a significant impact on performance because it will change the order in which threads are dispatched which will change the way data gets cached.
For example if you have a simple 2D bilateral filter that you split into a horizontal and a vertical pass, you should see some significant change in performance for the vertical pass depending on if you use {128,1} / { 1, 128 } / { 4, 32} , etc.
Unfortunately it's not always simple to predict so you will have to experiment to find what the best size is.
Regarding the CL_OUT_OF_RESOURCES: if you don't specify a local workgroup size at compilation time, the compiler will not try to put pressure on the registers and you might end up only being able to use a LWS of 64.
However if you specify the LWS in your kernel like this:
__kernel __attribute__((reqd_work_group_size(128,1,1))) void my_kernel(__global uchar *foo)
Then the compiler will does what it takes for you to be able to use a LWS of 128.
Most of the time it means it will use spilling to global memory in order to release the register pressure, which obviously has a cost but the benefit of having 128 threads running instead of 64 should outweigh it.
Hope this helps,
Hi Anthony,
The setting of the kernel attribute did work for me and I was able to launch 128 and 256 per work group. But unfortunately the performance degrades. This was expected probably because my Global Mem data footprint per shader core exceeds the L1 cache size of 16KB.
So I will have to change the algorithm intuitively and get performance improvements.
Once again thanks for all the help.
Regards,
To my surprise I got performance when I split image vertically. Earlier I was splitting horizontally.
(the consecutive threads were reading image data at strides of 2560*4, now it is 1280*4).
i.e splitting the image into half makes the algorithm to scale down linearly.
Could you please let me know the documents for MALI memory bank architecture?
At what conditions would the bank conflicts occur within a Work group?
I could not find any documentation regarding the bank conflicts.
Regards,Banger.
There is an article on Anandtech explaining the Midgard architecture: ARM’s Mali Midgard Architecture Explored
But most of the time the issue is not bank conflicts, but cache utilisation: if you have 128 threads running in parallel and you've got a 256KB cache then you only have 2KB per thread.
Because Mali has a very long pipeline it means that between the execution of two instructions of a given thread several other threads will have executed an instruction, which means even if two consecutive instructions access data from the same cache line in practice the line is likely to have been evicted by the time you execute the second instruction.
However if two neighbouring threads access data from the same cache line then because threads' instructions execution is interleaved then it's more likely that the second thread will be able to read the data while it's still in the cache.
That's why it's important to experiment with various shapes of local workgroup size depending on your kernel's memory access patterns (Especially if you have a lot of column accesses instead of horizontal accesses).
Anthony