Hello, we are developing a product based on maili T764 (RK3288) with OpenCL. In our kernel, we use about 1kB local memory every workgroup. I was wondering where is these local memory allocated, and if it is possible for us to taking advantage of the L2 Cache (1MB on RK3288) as the local memory, which may greatly speed up our program. Many thanks!
The GPU L2 in the RK3288 isn't 1MB; it's only 256KB. The 1MB cache is the CPU L2 cache, which is nothing to do with Mali at all ...
Our problem now is the frequently data transfer
Based on what you are saying you are reading and writing the same 1KB of memory multiple times from the same work item. That should be fine and should fit entirely inside the L1, let alone the L2, so memory bandwith _may_ not be your problem, although a lot depends how that is laid out in memory. How do you know this L2 to main memory bandwidth is your problem?
I'd suggest looking at some of the video tutorials here, as they look at a lot of detail about how memory accesses can be optimized in compute kernels, and explain how to profile using the performance counters.
GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center
HTH, Pete
EDIT: Fixed cache size, apparently 256KB.
Hi Peter, could you please tell me the maximum work items that can run at the same time on Mali T764 (RK3288), and the size of L1 cache in that GPU?
Many thanks!
Tan
Hi Tan,
The maximum occupancy on a Midgard GPU is 256 threads per shader core, so 1024 on a T760 MP4. Mali-T760 - ARM doesn't say anything about L1 cache, so it might not be public information. Pete will know
Hth,
Chris
Thanks Chris , Peter updated in his blog:
T760 has two 16KB L1 data caches per shader core; one for texture access and one for generic memory access.