This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

about local memory in opencl

tan414 over 9 years ago

Hello, we are developing a product based on maili T764 (RK3288) with OpenCL. In our kernel, we use about 1kB local memory every workgroup. I was wondering where is these local memory allocated, and if it is possible for us to taking advantage of the L2 Cache (1MB on RK3288) as the local memory, which may greatly speed up our program. Many thanks!

Top replies

tan414 over 9 years ago in reply to Chris Varnsverry +1 verified

Thanks Chris , Peter updated in his blog: T760 has t wo 16KB L1 data caches per shader core; one for texture access and one for generic memory access. Hth, Tan

Parents

0 tan414 over 9 years ago in reply to Peter Harris

Hi Peter,
Our kernel calculates about 500KB data (1KB each work group). Every work item loads data from gobal memory, does calculation and pushes result to the local memory, and then loads data from local memory calculated by work items in the same work group, repeating the previous calculating, pushing into and loading from local memory process 10 times, and then stores the final result into the global memory. Actually, in this process, only the initial data and final result need to be communicated with the CPU, and all other things can be done by the GPU itself. Our problem now is the frequently data transfer (10 x 500KB ) between L1, L2 cache and the main memory causing serious delay, and which seems not necessary. So if we can just keep the data in L2 without transferring them to main memory (in RK3288, L2 is 1MB, big enough for us), our kernel would have a much better efficiency.
Best!
Tan
Cancel
Vote up 0 Vote down

Cancel

Reply

0 tan414 over 9 years ago in reply to Peter Harris

Hi Peter,
Our kernel calculates about 500KB data (1KB each work group). Every work item loads data from gobal memory, does calculation and pushes result to the local memory, and then loads data from local memory calculated by work items in the same work group, repeating the previous calculating, pushing into and loading from local memory process 10 times, and then stores the final result into the global memory. Actually, in this process, only the initial data and final result need to be communicated with the CPU, and all other things can be done by the GPU itself. Our problem now is the frequently data transfer (10 x 500KB ) between L1, L2 cache and the main memory causing serious delay, and which seems not necessary. So if we can just keep the data in L2 without transferring them to main memory (in RK3288, L2 is 1MB, big enough for us), our kernel would have a much better efficiency.
Best!
Tan
Cancel
Vote up 0 Vote down

Cancel

Children

0 Peter Harris over 9 years ago in reply to tan414

The GPU L2 in the RK3288 isn't 1MB; it's only 256KB. The 1MB cache is the CPU L2 cache, which is nothing to do with Mali at all ...
Our problem now is the frequently data transfer
Based on what you are saying you are reading and writing the same 1KB of memory multiple times from the same work item. That should be fine and should fit entirely inside the L1, let alone the L2, so memory bandwith _may_ not be your problem, although a lot depends how that is laid out in memory. How do you know this L2 to main memory bandwidth is your problem?
I'd suggest looking at some of the video tutorials here, as they look at a lot of detail about how memory accesses can be optimized in compute kernels, and explain how to profile using the performance counters.
GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center
HTH,
Pete
EDIT: Fixed cache size, apparently 256KB.
Cancel
Vote up 0 Vote down

Cancel
0 tan414 over 9 years ago in reply to Peter Harris

Hi Peter, could you please tell me the maximum work items that can run at the same time on Mali T764 (RK3288), and the size of L1 cache in that GPU?
Many thanks!
Tan
Cancel
Vote up 0 Vote down

Cancel
0 Chris Varnsverry over 9 years ago in reply to tan414

Hi Tan,
The maximum occupancy on a Midgard GPU is 256 threads per shader core, so 1024 on a T760 MP4. Mali-T760 - ARM doesn't say anything about L1 cache, so it might not be public information. Pete will know
Hth,
Chris
Cancel
Vote up 0 Vote down

Cancel
+1 tan414 over 9 years ago in reply to Chris Varnsverry

Thanks Chris , Peter updated in his blog:
T760 has two 16KB L1 data caches per shader core; one for texture access and one for generic memory access.
Hth,
Tan
Cancel
Vote up +1 Vote down

Cancel