Hi,
I am observing GPU kernel is taking huge time when I am running an empty kernel. I am Using "Samsung Exynos Octa 5420 Board" which has Mali GPU. I have one kernel which is of work group size around "3000" when I am running it with logic inside kernel and passing arguments, it is taking 2 msec of time but when I run the same kernel without any logic in it and without passing any arguments it is taking 19 msec. What I heard is kernel with less load for will run very fast, but why kernel without any load is taking huge time? For answering this query just consider my logic of kernel as a simple factorial of NxN elements. I hope I gave complete information related my problem, please let me know any more information you need to solve this.
Thanks & Regards,
Narendra Kumar Chepuri.
Hi Narendra,
Can I confirm when you say "work group size around 3000" that you instead you mean a global work size of 3000, rather than the actual workgroup size.
It may be worth getting a small reproducer so we could test on our end and figure out what is happening.
If there is no arguments, then there is no cache maintenance operations and thus that wouldnt be the reason for the execution time. It could also be potentially due to the local workgroup size, however that doesnt fully explain such a large difference in execution time.
Can I also ask how you are timing this? There maybe an issue in this that is masking the true execution times.
Thanks,
Michael McGeagh
Hi Michael,
Here work group size mens global work size only and one more thing I tried with and without arguments with empty body of kernel but still I have seen same execution time. I have profiled using gettimeofday prebuilt fuction in C across the EnqueNdrangeKernel().
Thank & Regards,
Narendra Kumar.
Just FYI, the EnqueueNDRangeKernel entry point does not block until the end of kernel execution, it is an asynchronous call which enqueues the kernel to be executed and then returns. Measuring the runtime of this API call will not tell you the execution time of the kernel on the GPU, just how long it took for the entry point to return.
Hth,
Chris
Hi Chris.
I will agree with your statement but I am measuring time with start time before EnqueueNDRangeKernel and end time is after cl_flush and cl_finish which are blocking calls and confirms that kernel is executed completely.
I would agree with Michael's suggestion here of getting us a small reproducer. Can you put a minimum example together that exhibits this issue? It would likely be the most efficient way we can help you.
Regards,
Tim
Hi Hartley,
I will explain you block wise with some example.
block 1:
{
This is some input preparation part which will fill the GPU input source buffer;
}
block 2:
StartTime of profile;
Here it is kernel execution part which has SetKernelArguments followed by EnqueueNDRangeKernel, clFlush, clFinish;
EndTime of profile;
block3:
Print the execution time of kernel and some CPU task which works on GPU output.;
I think this information is sufficient for checking, if you still want more details on algorithm then I can share the exact code.
My apologies for getting back to you sooner - I had missed the above post.
I'm not sure there's enough there to easily track this down. Can you provide an actual simple reproducer for us to investigate further? It's not necessarily the algorithm you are using - whatever minimum piece of code exhibits the problem will do.