I am developing a project based on Mali T764 (RK3288). I have problems in the kernel queued time and the function clGetEventProfilingInfo() seems returning wrong value. My codes are below:
cl_ulong time_begin[20], time_finish[20],time_total[20]; //...initializing cmdQueue=clCreateCommandQueue(context,devices[0],CL_QUEUE_PROFILING_ENABLE,&err); … err=clEnqueueNDRangeKernel(cmdQueue,k1,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt1[0]);
err=clEnqueueNDRangeKernel(cmdQueue,k2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt2[0]); err=clEnqueueNDRangeKernel(cmdQueue,ki2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&iEvt2[0]);
err=clEnqueueNDRangeKernel(cmdQueue,ki1,2,NULL,globalWorkSize,localWorkSize,0,&iEvt1[0],&iEvt1[0]);
err=clEnqueueNDRangeKernel(cmdQueue,k1,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt1[1]);
err=clEnqueueNDRangeKernel(cmdQueue,k2,2,NULL,globalWorkSize,localWorkSize,0,NULL,&Evt2[1]);
…
err=clEnqueueNDRangeKernel(cmdQueue,ki1,2,NULL,globalWorkSize,localWorkSize,0,NULL, &iEvt1[5]);
clFinish(cmdQueue);
for(int xx=0;xx<6;xx++)
{
timeCount(Evt1[xx],&time_begin[xx*4],&time_finish[xx*4]);
timeCount(Evt2[xx],&time_begin[xx*4+1],&time_finish[xx*4+1]);
timeCount(iEvt2[xx],&time_begin[xx*4+2],&time_finish[xx*4+2]);
timeCount(iEvt1[xx],&time_begin[xx*4+3],&time_finish[xx*4+3]);
time_total[xx*4]=(time_finish[xx*4]-time_begin[xx*4]);
time_total[xx*4+1]=(time_finish[xx*4+1]-time_begin[xx*4+1]);
time_total[xx*4+2]=(time_finish[xx*4+2]-time_begin[xx*4+2]);
time_total[xx*4+3]=(time_finish[xx*4+3]-time_begin[xx*4+3]);
}
void timeCount(cl_event event,cl_ulong *time_begin,cl_ulong *time_finish) { *time_begin=0; *time_finish=0; err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong ),time_begin,NULL); //err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong) ,time_begin,NULL); err=clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),t ime_finish,NULL); }
These k1,k2,ik1,ik2 kernels do similar calculations and the execution time of them are supposed to be about 1ms. After running this program, I get weird result.
time_total[0]=1838000 ns
time_total[1]=2096000 ns
time_total[2]=75492000 ns
time_total[3]=26357000 ns
time_total[2]=199767000 ns
time_total[3]=228629000 ns
The first two elements may be correct, but the following results are unreasonable. There is no other program running on the system, and a similar kernel shouldn’t be queued so long time before it executed. I also tried other method to measure the whole execution time. It takes about 400ms for all kernels executed, which is 10 times more than it supposed.
My questions are
I will be very grateful if anyone could help me figure this out.
Dear neiltan,
Thanks for your reply. I will ask Rockchip to see if they can figure it out.
Best!
Tan