Opencl Kernels overhead(queue time) in Mali-G72


Hi,

I am using OpenCL 2.0 in Mali-G72 based Android device and I am encountering a very large kernel queue/submit time overhead (CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_QUEUED). it is sometimes 10X higher than the kernel execution time (CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START).

Whereas on other gpu devices(with same specs or even lesser clock speed) I am not getting this large overhead on the same OpenCL code.

I have around 12 kernels running in an in-order queue and the only MemoryMap operations are before the first kernel call and getting the output by MemoryMap operation after the last kernel call. And this runs in a loop. but from the first iteration to next and even in the kernels in the middle of the queue which don't have any memory operation before or after them, still have this large queuing/submit overhead time for kernel in this GPU while other gpu device doesn't have this issue.

here is an example of my profiling info.


--------------
Kernel 1 - execution time = 0.18ms
Kernel 1 - queued time = 0.07ms
Kernel 1 - submit time = 2.91ms


Kernel 2 - execution time = 0.04ms
Kernel 2 - queued time = 0.07ms
Kernel 2 - submit time = 0.12ms

Kernel 3 - execution time = 0.01ms
Kernel 3 - queued time = 0.09ms
Kernel 3 - submit time = 0.15ms

Kernel 4 - execution time = 0.01ms
Kernel 4 - queued time = 0.08ms
Kernel 4 - submit time = 0.12ms

Kernel 5 - execution time = 0.18ms
Kernel 5 - queued time = 0.08ms
Kernel 5 - submit time = 0.12ms

Kernel 6 - execution time = 0.02ms
Kernel 6 - queued time = 0.08ms
Kernel 6 - submit time = 0.18ms

Kernel 7 - execution time = 0.01ms
Kernel 7 - queued time = 0.08ms
Kernel 7 - submit time = 0.12ms


Kernel 8 - execution time = 0.02ms
Kernel 8 - queued time = 0.07ms
Kernel 8 - submit time = 0.10ms


Kernel 9 - execution time = 0.01ms
Kernel 9 - queued time = 0.83ms
Kernel 9 - submit time = 0.72ms


Kernel 10 - execution time = 0.57ms
Kernel 10 - queued time = 0.15ms
Kernel 10 - submit time = 0.11ms


Kernel 11 - execution time = 0.07ms
Kernel 11 - queued time = 0.07ms
Kernel 11 - submit time = 0.13ms


Kernel 12 - execution time = 0.01ms
Kernel 12 - queued time = 0.07ms
Kernel 12 - submit time = 0.13ms

-------
ITERATION overallWorkTime = 1.12ms
ITERATION overallSetupTime = 6.66ms
ITERATION overallTotalWorkTime = 7.79ms
--------------

I would really appreciate if anybody could point in me in the right direction I have been trying to minimize the timings by different methods but still not getting rid of this overhead. Thanks.

More questions in this forum