This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL strange  Performance Numbers -Mali T760 MP8

I have a fairly complex OpenCL implementation with 2D NDRange as follows:

Num of Work Groups - {10,7}

Work Group Size {64,1},

With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance degrades to 0.710

Below is the different number.

{10,7} - 0.625 secs

{10,4} - 0.710 secs

{10,3} - 0.759 secs

{10,2} - 0.826 secs

{10,1} - 0.185 secs  (this is less as expected)


This seems to be strange for me as the taken should have been less.

I am timing only the kernel execution, with OpenCL events. And for the entire computation for {10,7} work groups it takes only around 1G floating point operations.


With a Peak throughput of around 200GFlops for T760MP8, My target is to achieve  3 FPS for the entire operation. That is the entire algorithm should execute in .33 secs.

I am looking for more deeper optimization options, I have tried vectorization and loop unrolling but still the performance is only at 0.625 secs.


Can anyone tell me the reason behind this? Or am I missing something.

Also can any Linux tool help me find the bottleneck in the code. Number of Scalar and Vector Registers used etc.

Platform details:

MALI T760 MP8 on Exynos 7420 platform

Mobile - Galaxy S6

Thanks in Advance

Banger.

Parents Reply Children
  • Hi ravibanger,

    There is an article on Anandtech explaining the Midgard architecture: ARM’s Mali Midgard Architecture Explored

    But most of the time the issue is not bank conflicts, but cache utilisation: if you have 128 threads running in parallel and you've got a 256KB cache then you only have 2KB per thread.

    Because Mali has a very long pipeline it means that between the execution of two instructions of a given thread several other threads will have executed an instruction, which means even if two consecutive instructions access data from the same cache line in practice the line is likely to have been evicted by the time you execute the second instruction.

    However if two neighbouring threads access data from the same cache line then because threads' instructions execution is interleaved then it's more likely that the second thread will be able to read the data while it's still in the cache.

    That's why it's important to experiment with various shapes of local workgroup size depending on your kernel's memory access patterns (Especially if you have a lot of column accesses instead of horizontal accesses).

    Hope this helps,

    Anthony