This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

MALI OpenCL: clEnqueueNDRangeKernel and clEnqueueTask has high API overhead

Dear All,

One of my use cases of ARM Mali graphics is running Video(HEVC) Decode Kernels. But, what we discover is that the OpenCL Kernel call APIs clEnqueueNDRangeKernel and clEnqueueTask overhead is much higher than the execution time of the kernel. This reduces the overall Video decoding speed considerably.

Is there anything we can do to reduce this overhead ? Any tips ? Or if you need more details about the issue, I can explain.

Regards

Paul

Parents
  • Hi Anthony,

    Thanks for your reply.

    After doing some research with profiling for overheads, we realized our smaller kernels are too fast to hide any enqueue times.

    The JPEG processing example is good. But for real-time Video processing where FPS performance throughput is of prime importance, asynchronous enqueuing of smaller kernels will not help much. Merging smaller kernels into a bigger one and dispatching more threads that engages all the compute units of the GPU is probably a better option. But larger kernels are likely to be slower because of extensive branch and memory divergence.

    In general, one improvement that may be desirable in OpenCL GPU Drivers is to minimize the launch overheads at run-time and rather move them towards initialization tasks. Or, allow some low-level control to the programmer to manage enqueue and submit activities to the GPU.

    Regards,

    Paul

Reply
  • Hi Anthony,

    Thanks for your reply.

    After doing some research with profiling for overheads, we realized our smaller kernels are too fast to hide any enqueue times.

    The JPEG processing example is good. But for real-time Video processing where FPS performance throughput is of prime importance, asynchronous enqueuing of smaller kernels will not help much. Merging smaller kernels into a bigger one and dispatching more threads that engages all the compute units of the GPU is probably a better option. But larger kernels are likely to be slower because of extensive branch and memory divergence.

    In general, one improvement that may be desirable in OpenCL GPU Drivers is to minimize the launch overheads at run-time and rather move them towards initialization tasks. Or, allow some low-level control to the programmer to manage enqueue and submit activities to the GPU.

    Regards,

    Paul

Children
No data