I am using OpenCL 2.0 in Mali-G72 based Android device and I am encountering a very large kernel queue/submit time overhead (CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_QUEUED). it is sometimes 10X higher than the kernel execution time (CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START).
Whereas on other gpu devices(with same specs or even lesser clock speed) I am not getting this large overhead on the same OpenCL code.
I have around 12 kernels running in an in-order queue and the only MemoryMap operations are before the first kernel call and getting the output by MemoryMap operation after the last kernel call. And this runs in a loop. but from the first iteration to next and even in the kernels in the middle of the queue which don't have any memory operation before or after them, still have this large queuing/submit overhead time for kernel in this GPU while other gpu device doesn't have this issue.
here is an example of my profiling info.
--------------Kernel 1 - execution time = 0.18msKernel 1 - queued time = 0.07msKernel 1 - submit time = 2.91ms
Kernel 2 - execution time = 0.04msKernel 2 - queued time = 0.07msKernel 2 - submit time = 0.12ms
Kernel 3 - execution time = 0.01msKernel 3 - queued time = 0.09msKernel 3 - submit time = 0.15ms
Kernel 4 - execution time = 0.01msKernel 4 - queued time = 0.08msKernel 4 - submit time = 0.12ms
Kernel 5 - execution time = 0.18msKernel 5 - queued time = 0.08msKernel 5 - submit time = 0.12ms
Kernel 6 - execution time = 0.02msKernel 6 - queued time = 0.08msKernel 6 - submit time = 0.18ms
Kernel 7 - execution time = 0.01msKernel 7 - queued time = 0.08msKernel 7 - submit time = 0.12ms
Kernel 8 - execution time = 0.02msKernel 8 - queued time = 0.07msKernel 8 - submit time = 0.10ms
Kernel 9 - execution time = 0.01msKernel 9 - queued time = 0.83msKernel 9 - submit time = 0.72ms
Kernel 10 - execution time = 0.57msKernel 10 - queued time = 0.15msKernel 10 - submit time = 0.11ms
Kernel 11 - execution time = 0.07msKernel 11 - queued time = 0.07msKernel 11 - submit time = 0.13ms
Kernel 12 - execution time = 0.01msKernel 12 - queued time = 0.07msKernel 12 - submit time = 0.13ms
-------ITERATION overallWorkTime = 1.12msITERATION overallSetupTime = 6.66msITERATION overallTotalWorkTime = 7.79ms--------------
I would really appreciate if anybody could point in me in the right direction I have been trying to minimize the timings by different methods but still not getting rid of this overhead. Thanks.
How often are you flushing the command queue? What driver version are you using?
The submission overhead can depend platform-specific factors such as power management.
You can also try to use cl_khr_priority_hints (https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/cl_khr_priority_hints.html) to increase the priority of the command queue. This will reduce the submission overhead on most platforms.
View all questions in Graphics and Gaming forum