This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

about local memory in opencl

Hello, we are developing a product based on maili T764 (RK3288) with OpenCL. In our kernel, we use about 1kB local memory every workgroup. I was wondering where is these local memory allocated, and if it is possible for us to taking advantage of the L2 Cache (1MB on RK3288)  as the local memory, which may greatly speed up our program. Many thanks!

Parents
  • Hi Peter,

    Our kernel calculates about 500KB data (1KB each work group). Every work item loads data from gobal memory, does calculation and pushes result to the local memory, and then loads data from local memory calculated by work items in the same work group, repeating the previous calculating, pushing into and loading from local memory process 10 times, and then stores the final result into the global memory. Actually, in this process, only the initial data and final result need to be communicated with the CPU, and all other things can be done by the GPU itself. Our problem now is the frequently data transfer (10 x 500KB ) between L1, L2 cache  and the main memory causing serious delay, and which seems not necessary. So if we can just keep the data in L2 without transferring them to main memory (in RK3288, L2 is 1MB, big enough for us), our kernel would have a much better efficiency.

    Best!

    Tan

Reply
  • Hi Peter,

    Our kernel calculates about 500KB data (1KB each work group). Every work item loads data from gobal memory, does calculation and pushes result to the local memory, and then loads data from local memory calculated by work items in the same work group, repeating the previous calculating, pushing into and loading from local memory process 10 times, and then stores the final result into the global memory. Actually, in this process, only the initial data and final result need to be communicated with the CPU, and all other things can be done by the GPU itself. Our problem now is the frequently data transfer (10 x 500KB ) between L1, L2 cache  and the main memory causing serious delay, and which seems not necessary. So if we can just keep the data in L2 without transferring them to main memory (in RK3288, L2 is 1MB, big enough for us), our kernel would have a much better efficiency.

    Best!

    Tan

Children