Hello, we are developing a product based on maili T764 (RK3288) with OpenCL. In our kernel, we use about 1kB local memory every workgroup. I was wondering where is these local memory allocated, and if it is possible for us to taking advantage of the L2 Cache (1MB on RK3288) as the local memory, which may greatly speed up our program. Many thanks!
Hi Peter, thanks for your reply! My further question is if we can just cached the data (1KB per workgroup, <1MB in total) in L2 without transferring them into the system RAM? Because for our program, the L2 is big enough for storing all data need to be calculated, and the biggest delay at the moment seems to be the transferring process between the cache and the RAM.
Best!
Tan
Hi Tan,
Note that the L1 and L2 I mention above are the GPU L1 and L2, not the CPU L1 and L2. You still have a need to push data out of the CPU cache and into main memory so that the GPU can see it.
HTH, Pete
Hi Peter,
Our kernel calculates about 500KB data (1KB each work group). Every work item loads data from gobal memory, does calculation and pushes result to the local memory, and then loads data from local memory calculated by work items in the same work group, repeating the previous calculating, pushing into and loading from local memory process 10 times, and then stores the final result into the global memory. Actually, in this process, only the initial data and final result need to be communicated with the CPU, and all other things can be done by the GPU itself. Our problem now is the frequently data transfer (10 x 500KB ) between L1, L2 cache and the main memory causing serious delay, and which seems not necessary. So if we can just keep the data in L2 without transferring them to main memory (in RK3288, L2 is 1MB, big enough for us), our kernel would have a much better efficiency.
The GPU L2 in the RK3288 isn't 1MB; it's only 256KB. The 1MB cache is the CPU L2 cache, which is nothing to do with Mali at all ...
Our problem now is the frequently data transfer
Based on what you are saying you are reading and writing the same 1KB of memory multiple times from the same work item. That should be fine and should fit entirely inside the L1, let alone the L2, so memory bandwith _may_ not be your problem, although a lot depends how that is laid out in memory. How do you know this L2 to main memory bandwidth is your problem?
I'd suggest looking at some of the video tutorials here, as they look at a lot of detail about how memory accesses can be optimized in compute kernels, and explain how to profile using the performance counters.
GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center
EDIT: Fixed cache size, apparently 256KB.
Hi Peter, could you please tell me the maximum work items that can run at the same time on Mali T764 (RK3288), and the size of L1 cache in that GPU?
Many thanks!
The maximum occupancy on a Midgard GPU is 256 threads per shader core, so 1024 on a T760 MP4. Mali-T760 - ARM doesn't say anything about L1 cache, so it might not be public information. Pete will know
Hth,
Chris
Thanks Chris , Peter updated in his blog:
T760 has two 16KB L1 data caches per shader core; one for texture access and one for generic memory access.