Hi,
I am trying to implement an OpenCL kernel on G76 with DDK r16.
I find that if I define and use an array like "half A[16];", the performance will be poor.
But if I use "half16 A;", the performance is very good.
I wonder if array is mapping into global memory so that the performance is poor when using array?
However, I need to use array instead of vector because the algorithm needs to index the ith element "A[i]" in a for loop.
I think it is impossible to use vector in such a way "A.si".
Can anyone help me?
Thank you very much in advance!
A lot depends what your code looks like and how the array is being used at runtime. Some arrays can be promoted to registers, some cannot - it all depends on what the kernel code is doing with them.
Can you share a kernel?
Cheers,Pete
Hi Peter, thank you very much! It is really like what you say, it depends on my code!!
I have another question:
Do you know is there any dedicated hardware for local memory on G76?
I want to use local memory to optimize the gemm, do you think it is a good idea?
Mali GPUs do not have an dedicated local memory for compute shaders.