This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

MALI OpenCL: clEnqueueNDRangeKernel and clEnqueueTask has high API overhead

Dear All,

One of my use cases of ARM Mali graphics is running Video(HEVC) Decode Kernels. But, what we discover is that the OpenCL Kernel call APIs clEnqueueNDRangeKernel and clEnqueueTask overhead is much higher than the execution time of the kernel. This reduces the overall Video decoding speed considerably.

Is there anything we can do to reduce this overhead ? Any tips ? Or if you need more details about the issue, I can explain.

Regards

Paul

  • Hi Paul,

    If the enqueue time cannot be hidden by the execution time it's likely that your kernels are too small (Which is even more likely if you're using clEnqueueTask as this essentially runs a kernel with only one thread).

    So maybe try to regroup your kernels and dispatch more threads.

    In case you're not doing it already you could have a look at asynchronously enqueuing your kernels as explained here: ARM Guide to OpenCL Implementing JPEG: Application Design for the Application Processor and GPU

    Hope this helps,

    Anthony

  • Hi Anthony,

    I met a similar problem with Paul. I found the clCreateKernel for some kernels were very slow, ranging from 15ms ~ 50ms. I use mali_clcc -v option to dump the instruction number. The numbers of instruction words emitted ranged from 12 to 136. Further more, it seems the number of instruction did not have any relations with the kernel creation time.

    So I am wonder what kinds of factors affect the clCreateKernel execution time ?  the instruction number?  create program with binary or source?  code length?

    I shall be very appreciated if you can give me some clues. It is very important to our production (Though the clCreateKernel is often used once during the application, sometimes it still influences the whole pipeline).

    Thanks in advance.

  • Hi,

    clCreateKernel has nothing to do with clEnqueueNDRangeKernel.

    In your case if you want to speed up the kernel creation then you need to use a binary program.

    To generate a binary program:

    - build a program from sources

    - build all the kernels in the program (If you don't do that then the binary you will save will be an IR rather than an actual binary)

    - Retrieve the program binary using clGetProgramInfo and save it to a file.

    This should be much quicker than building from sources, if it's not it's likely that your driver is too old.

    Hope this helps,

    Anthony

  • Hi, Anthony

    It works to employ the prebuilt binary file.  Time consumption reduced from 10+ms to 100+us.

    Thumbs up

    Irving

  • Hi Anthony,

    Thanks for your reply.

    After doing some research with profiling for overheads, we realized our smaller kernels are too fast to hide any enqueue times.

    The JPEG processing example is good. But for real-time Video processing where FPS performance throughput is of prime importance, asynchronous enqueuing of smaller kernels will not help much. Merging smaller kernels into a bigger one and dispatching more threads that engages all the compute units of the GPU is probably a better option. But larger kernels are likely to be slower because of extensive branch and memory divergence.

    In general, one improvement that may be desirable in OpenCL GPU Drivers is to minimize the launch overheads at run-time and rather move them towards initialization tasks. Or, allow some low-level control to the programmer to manage enqueue and submit activities to the GPU.

    Regards,

    Paul