Memory Access Optimization for OpenCL Programs Running on Mali GPU

What is the most efficient memory access method when I run my OpenCL program on the Mali GPU, what should be the memory access order for different cores and threads, and is there any relevant documentation to explain it.

for example, The Mali G710 GPU has 10 cores, with a maximum thread count of 2048 or 1024 per core. When I set the local work size in opencl to {16,8}, it means that each core only uses 128 threads. When I adjust the local work size to {32,8}, it means that each core only uses 256 threads, which should have a higher throughput rate, but the actual results are the opposite; Can anyone explain this phenomenon?