What is the most efficient memory access method when I run my OpenCL program on the Mali GPU, what should be the memory access order for different cores and threads, and is there any relevant documentation to explain it.
for example, The Mali G710 GPU has 10 cores, with a maximum thread count of 2048 or 1024 per core. When I set the local work size in opencl to {16,8}, it means that each core only uses 128 threads. When I adjust the local work size to {32,8}, it means that each core only uses 256 threads, which should have a higher throughput rate, but the actual results are the opposite; Can anyone explain this phenomenon?
Old thread, but to answer this one ...
The workgroup size does not determine thread occupancy. Shader cores can run multiple work groups, so in both cases you should be able to use all thread slots based on the register usage of the shader program.
Differences in performance will be caused by changes in memory access pattern and temporal locality of data accesses, but exactly what this looks like depends on what your compute kernel is doing.
HTH,Pete