Hi,
We are developing a program based on RK3288 (Mali T764) with Opencl. We met a very strange problem.
Our kernel has 4 workgroups with {2,64} workitems (128 x 4 in total). The input includes two 512x768 global float arrays, one 384 global float array, and the output is one 512x768 array. We initialize one 768 local float4 array, one 384 float array. At the beginning, two line of the 512x768 global float array is loaded into the 768 local float4 array, and the whole 384 global float is loaded into the local 384 array. We do some calculations, and then save the output. Until these process, everything fine.
We need to synchronize the local memory for further coding, so we add a barrier(CLK_LOCAL_MEM_FENCE), however, the trouble comes. clEnqueueNDRangeKernel() returns an error number -5, which seems meaning CL_OUT_OF_RESOURCES. We did further tests to check the error.
1. Add the barrier but delete the code of calculation. The kernel works fine.
2. Add the barrier, keep the calculation but delete the code of loading two line from the global 512x768 arrays (the data in the local memory is 0). The kernel also works fine.
We are totally confused what is wrong with our code, or it is the problem of the driver or Opencl itself. Could anyone help us figure it out? Many thanks!
The code is here:
original code with barrier (if deleting the barrier in the middle, it works fine)
const char* f1_ap2xy[] = {
" _kernel void f1_ap2xy (_global float * in_amp,__global *in_phase, __global float * in_lut, __global float * out)\n" " {\n"
"local float4 temp[768];\n" "local float lut[384];\n" "uint4 gid;\n" "float4 r0_x,r1_x,r0_y,r1_y,t;\n" "gid.x=(get_global_id(0)>>1)*98304+((get_global_id(0)&0x0001)*768);\n""gid.y=(get_global_id(1)<<2)+gid.x;\n" "gid.z=get_local_id(1);\n" "gid.w=get_local_id(0)*192+gid.z;\n" "__global float4 trans=(global float4)(in_amp+gid.y);\n"
//read data from global "r0_x=trans[0];\n" "r1_x=trans[64];\n" "t=trans[128];\n" "trans=(global float4*)(in_phase+gid.y);\n" "r0_y=trans[0];\n" "r1_y=trans[64];\n" "temp[gid.w+64]=trans[128];\n" "trans=(global float4*)(out+gid.y);\n"
"lut[(get_local_id(0)<<6)+gid.z]=in_lut[(get_local_id(0)<<6)+gid.z];\n" "lut[(get_local_id(0)<<6)+gid.z+128]=in_lut[(get_local_id(0)<<6)+gid.z+128];\n" "lut[(get_local_id(0)<<6)+gid.z+256]=in_lut[(get_local_id(0)<<6)+gid.z+256];\n"
"barrier(CLK_LOCAL_MEM_FENCE);\n"
//calculate "temp[gid.w]=r0_x+r1_x+t;\n" "temp[gid.w+384]=r0_y+r1_y+temp[gid.w+64];\n" "temp[gid.w+128]=r0_x-(0.5*r1_x-(0.8660254*r1_y))-((temp[gid.w+64]*0.8660254)+(t*0.5));\n" "temp[gid.w+512]=r0_y-(0.5*r1_y+(0.8660254*r1_x))+((0.8660254*t)-(0.5*temp[gid.w+64]));\n" "temp[gid.w+448]=r0_y-(0.5*r1_y-(0.8660254*r1_x))-((0.8660254*t)+(0.5*temp[gid.w+64]));\n" "temp[gid.w+64]=r0_x-(0.5*r1_x+(0.8660254*r1_y))+((0.8660254*temp[gid.w+64])-(0.5*t));\n"
//output "trans[0]=temp[gid.w+384];\n" "trans[64]=temp[gid.w+512];\n" "trans[128]=temp[gid.w+448];\n" "}\n" };
1. Add the barrier but delete the code of calculation
2. Add the barrier, keep the calculation but delete the code of loading two line from the global 512x768 arrays.
"trans=(global float4*)(out+gid.y);\n"
According to the CL documentation for clEnqueueNDRangeKernel :
CL_OUT_OF_RESOURCES if there is a failure to queue the execution instance of kernel on the command-queue because of insufficient resources needed to execute the kernel. For example, the explicitly specified local_work_size causes a failure to execute the kernel because of insufficient resources such as registers or local memory.
kernel
local_work_size
When you introduce the barrier, the number of registers needed to execute your kernel increases (It prevents the compiler from re-ordering some operations therefore forcing it to store more temporary values).
If you tell the compiler from the start what group size you're planning on using, it might be able to deal with it.
Could you please try to recompile your kernel specifying the work group size ?
__attribute__((reqd_work_group_size(2,64,1))) __kernel void f1_ap2xy(__global float * in_amp,__global float *in_phase, __global float * in_lut, __global float * out)
Let me know how it goes.
Thanks,
Anthony
Hi Anthony, thanks very much for your help. It works now.
I have further questions.
1. Mali T764 has 4 cores, so we set 4 workgroups and want to fit one workgroup into one core. Is that correct?
2. Since every core has 16KB L1 cache, we apply one 768 float4 array and one 384 float array local memory (13.5KB in total), which might be kept in the L1 during the whole process. Therefore, the operation of this local memory is very fast. Is that right?
3. We try to use less registers than 8 in order to achieve 128 concurrent threads. In the kernel above, we have 1 uint4, 5 float4, and 1 global float*, so it is less than 8 registers, and 128 concurrent threads can be achieved? Am I right?
I am looking forward to your reply!
Best regards!
Tan
1. In theory yes, but in practice no: if you start only a few threads then in order to save power the driver will not start all the cores.
2. It's not that simple, you also have your LUT and "out" array, so chances are it doesn't all fit in the cache. You can try to use DS-5 Streamline Streamline Full Features Overview | ARM DS-5 Development Studio and look at the ratio "$MaliLoadStorePipeLSInstructions" / "$MaliLoadStorePipeLSInstructionIssues" (i.e number of instructions completing vs number of instructions issued).
3. The number of registers used is not defined by the number of private variables you have in your kernel : every time you perform an arithmetic operation the two components have to be in registers, the same is true for memory loads and because Mali has a tripipe these operations are executed in parallel. But by specifying the work_group_size inside your kernel sources this forces the compiler to only use 8 registers per work item (It will spill registers if it can't all fit in 8 registers, which might be an issue if your kernel is Load / Store bound but will probably be still faster than running fewer threads).
Thanks a lot. BTW, is there any way to force the GPU starts all cores, since the power consumption in our application is not a major concern.
Not really, but you can try to start more threads even if they don’t do anything. Howeverremember there is no guarantee about the order in which they will get spawned.
Also if you want the best performance, make sure you disable DVFS:
echo "off" > /sys/class/misc/mali0/device/dvfs
And set the GPU at the maximum frequency:
echo "533" > /sys/class/misc/mali0/device/clock