I am developing a project based on Mali T764 (RK3288). I have problems in the kernel queued time and the function clGetEventProfilingInfo() seems returning wrong value. My codes are below:
cl_ulong time_begin[20], time_finish[20],time_total[20]; //...initializing cmdQueue=clCreateCommandQueue(context,devices[0],CL_QUEUE_PROFILING_ENABLE,&err); … err=clEnqueueNDRangeKernel(cmdQueue,k1,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt1[0]);
err=clEnqueueNDRangeKernel(cmdQueue,k2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt2[0]); err=clEnqueueNDRangeKernel(cmdQueue,ki2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&iEvt2[0]);
err=clEnqueueNDRangeKernel(cmdQueue,ki1,2,NULL,globalWorkSize,localWorkSize,0,&iEvt1[0],&iEvt1[0]);
err=clEnqueueNDRangeKernel(cmdQueue,k1,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt1[1]);
err=clEnqueueNDRangeKernel(cmdQueue,k2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt2[1]);
…
err=clEnqueueNDRangeKernel(cmdQueue,ki1,2,NULL,globalWorkSize,localWorkSize,0,NULL, &iEvt1[5]);
clFinish(cmdQueue);
for(int xx=0;xx<6;xx++)
{
timeCount(Evt1[xx],&time_begin[xx*4],&time_finish[xx*4]);
timeCount(Evt2[xx],&time_begin[xx*4+1],&time_finish[xx*4+1]);
timeCount(iEvt2[xx],&time_begin[xx*4+2],&time_finish[xx*4+2]);
timeCount(iEvt1[xx],&time_begin[xx*4+3],&time_finish[xx*4+3]);
time_total[xx*4]=(time_finish[xx*4]-time_begin[xx*4]);
time_total[xx*4+1]=(time_finish[xx*4+1]-time_begin[xx*4+1]);
time_total[xx*4+2]=(time_finish[xx*4+2]-time_begin[xx*4+2]);
time_total[xx*4+3]=(time_finish[xx*4+3]-time_begin[xx*4+3]);
}
void timeCount(cl_event event,cl_ulong *time_begin,cl_ulong *time_finish) { *time_begin=0; *time_finish=0; err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong ),time_begin,NULL); //err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong) ,time_begin,NULL); err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),t ime_finish,NULL); }
These k1,k2,ik1,ik2 kernels do similar calculations and the execution time of them are supposed to be about 1ms. After running this program, I get weird result.
time_total[0]=1838000 ns
time_total[1]=2096000 ns
time_total[2]=75492000 ns
time_total[3]=26357000 ns
time_total[2]=199767000 ns
time_total[3]=228629000 ns
The first two elements may be correct, but the following results are unreasonable. There is no other program running on the system, and a similar kernel shouldn’t be queued so long time before it executed. I also tried other method to measure the whole execution time. It takes about 400ms for all kernels executed, which is 10 times more than it supposed.
My questions are
I will be very grateful if anyone could help me figure this out.
Dear tan414,
First of all, thanks for the detailed query. We understand that it is essential to get accurate profiling information. Regarding to the clGetEventProfilingInfo() returns problematic values, this seem to be an issue related to the platform (RK3288). May I please redirect you to Rockchip. Please mention the problem to them, and they will have to provide you with a new BSP.
Thanks,
Neil
Dear neiltan,
Thanks for your reply. I will ask Rockchip to see if they can figure it out.
Best!
Tan