I have a fairly complex OpenCL implementation with 2D NDRange as follows:
Num of Work Groups - {10,7}
Work Group Size {64,1},
With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710
Below is the different number.
{10,7} - 0.625 secs
{10,4} - 0.710 secs
{10,3} - 0.759 secs
{10,2} - 0.826 secs
{10,1} - 0.185 secs (this is less as expected)
This seems to be strange for me as the taken should have been less.
I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.
With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve 3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.
I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.
Can anyone tell me the reason behind this? Or am I missing something.
Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.
Platform details:
MALI T760 MP8 on Exynos 7420 platform
Mobile - Galaxy S6
Thanks in Advance
Banger.
Hi ravibanger,
The reason the time increases is because you don't dispatch enough threads:
In theory: you can execute up to 256 threads at the same time on a core, you've got 8 cores therefore if you dispatch 2000 threads all the threads will execute in parallel
In practice: If the number of threads is close to the limit, the driver will consider it's not worth turning on all the cores and will instead serialise the jobs on a smaller number of cores.
So, I'm afraid the only solution is to dispatch more threads.
Regarding the static analysis: All Mali registers are 128 bits (There is no split between scalar / vector). We don't currently provide any static analysis tools.
However you can use DS-5 Streamline to see how busy the GPU is and how filled the 3 pipes are ( L/S, ALU, Texture).
Hope this helps,
Thanks Anthony for the quick response,
I have a couple of follow up questions.
I tried to launch the kernel with work group size to 128,1.
But my clEnqueueNDRange is giving me CL_OUT_OF_RESOURCES error.
I have not used any local memory as MALI OpenCL optimization guide suggests not to use.
Hence I am guessing that it may be because of the large register pressure caused inside the kernel with increase in work group size.
The only solution probably is to split the kernel into multiple simpler kernels and then launch with large work group size. Am I right?
My global memory footprint also increases two times when I increase the work group size to 128. This will remain same even if I split the kernel into multiple kernels.
Can this high global memory Footprint still cause CL_OUT_OF_RESOURCES error on enqueing my NDRange?
What are the possible reasons for CL_OUT_OF_RESOURCES in MALI OpenCL when we do clEnqueueNDRange?
Basically I want to rule out the cause of CL_OUT_OF_RESOURCES due to high global memory footprint, which is highly unlikely. Can you confirm this?
If yes I will go ahead and split my kernels and check my performance.
Regards
The optimisation guide is a bit out of date, sorry about that: quite often for 2D or 3D workgroup sizes providing a local worgroup size to clEnqueueNDRangeKernel will have a significant impact on performance because it will change the order in which threads are dispatched which will change the way data gets cached.
For example if you have a simple 2D bilateral filter that you split into a horizontal and a vertical pass, you should see some significant change in performance for the vertical pass depending on if you use {128,1} / { 1, 128 } / { 4, 32} , etc.
Unfortunately it's not always simple to predict so you will have to experiment to find what the best size is.
Regarding the CL_OUT_OF_RESOURCES: if you don't specify a local workgroup size at compilation time, the compiler will not try to put pressure on the registers and you might end up only being able to use a LWS of 64.
However if you specify the LWS in your kernel like this:
__kernel __attribute__((reqd_work_group_size(128,1,1))) void my_kernel(__global uchar *foo)
Then the compiler will does what it takes for you to be able to use a LWS of 128.
Most of the time it means it will use spilling to global memory in order to release the register pressure, which obviously has a cost but the benefit of having 128 threads running instead of 64 should outweigh it.
Hi Anthony,
The setting of the kernel attribute did work for me and I was able to launch 128 and 256 per work group. But unfortunately the performance degrades. This was expected probably because my Global Mem data footprint per shader core exceeds the L1 cache size of 16KB.
So I will have to change the algorithm intuitively and get performance improvements.
Once again thanks for all the help.
Regards,
To my surprise I got performance when I split image vertically. Earlier I was splitting horizontally.
(the consecutive threads were reading image data at strides of 2560*4, now it is 1280*4).
i.e splitting the image into half makes the algorithm to scale down linearly.
Could you please let me know the documents for MALI memory bank architecture?
At what conditions would the bank conflicts occur within a Work group?
I could not find any documentation regarding the bank conflicts.
Regards,Banger.
There is an article on Anandtech explaining the Midgard architecture: ARM’s Mali Midgard Architecture Explored
But most of the time the issue is not bank conflicts, but cache utilisation: if you have 128 threads running in parallel and you've got a 256KB cache then you only have 2KB per thread.
Because Mali has a very long pipeline it means that between the execution of two instructions of a given thread several other threads will have executed an instruction, which means even if two consecutive instructions access data from the same cache line in practice the line is likely to have been evicted by the time you execute the second instruction.
However if two neighbouring threads access data from the same cache line then because threads' instructions execution is interleaved then it's more likely that the second thread will be able to read the data while it's still in the cache.
That's why it's important to experiment with various shapes of local workgroup size depending on your kernel's memory access patterns (Especially if you have a lot of column accesses instead of horizontal accesses).
Anthony