Hi All,
Do we have some idea as to when would ARM release SDK for later versions of OpenCL ( 2.x) ?
currently we see only OpenCL 1.1 SDK available.
Regards
Paul
Hi Anthony,
Thanks for your response.
While working on an OpenCL acceleration project for a Video decoder on Firefly RK3288 Development board(Cortex A-17 / MALI T-764) and Android, we discovered that OpenCL's kernel launching API clEnqueueNDRangeKernel has got huge overheads. Our test kernel contains an indexed integer addition of two buffers (512 bytes) with results stored in a third buffer. The same kernel was launched twice (repeated calls to measure if the second call had some advantages).
OpenCL ARM profiling:
Queued Time(ns) Submit Time(ns) Execution Start Time(ns) Execution End Time(ns)
First Launch: 30,12,55,724 3,65,38,31,696 30,12,55,724 3,65,39,77,696
Second Launch: 30,12,55,724 3,65,59,68,696 30,12,55,724 3,65,60,81,696
The board uses ARM OpenCL 1.1 Drivers. From the above figures, what seems surprising or erratic is 'Queued Time' and 'Execution start time' returned the same value.
Also there is a big time gap between 'Queued Time' and 'Submit Time to Device' (huge overhead).
While the same kernel launched by same OpenCL host code on Intel XEON desktop gave the following results.....
OpenCL Intel XEON Profiling:
First Launch: 3,80,34,22,080 3,80,34,38,720 3,80,36,53,504 3,80,36,66,432
Second Launch: 3,80,46,77,760 3,80,47,02,336 3,80,47,23,936 3,80,47,27,936
In this case, we can see all the timings seems to be normal.
Our question:
1. What is that we are missing ? Is it that our OpenCL Drivers on the board are old ? Or in general, such values are normal with MALI OpenCL.
2. There is a new feature in OpenCL 2.x - 'Device side Kernel En-queue (nested parallelism)'. Could this be a workaround to avoid the Host Kernel Launch overheads ?
3. Any Other suggestions..
Can you please help ?
Hi Paul,
I'm sorry, I don't understand what units the numbers in your tables are in ? (Why are they packed by groups of 2 digits ?)
Also, I assume the times for each column come from clGetEventProfilingInfo and therefore correspond to CL_PROFILING_COMMAND_QUEUED, CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START, CL_PROFILING_COMMAND_END.
If that's correct then command_start and command_end are the timestamps for when the job was executed on the GPU, queues is when clEnqueue was called and submit is when the queue was flushed by the application.
None of those allow you to draw any conclusion about the time it takes to enqueue a kernel.
Regards,
Anthony