Hi,
I have an Odroid XU3 board. And i am trying to program the Mali-T628 GPU on this board with OpenCL.
With the devices example that comes with the Mali SDK, I came to understand that there are two GPU devices on the Mali-T628. One device with 4 compute units and another device with 2 compute units. I was able to use these devices separately and found that the device with 2 compute units is slower than the device with 4 compute units. But I could not get them to run in parallel.
I created separate command queues for these devices and enqueued the kernels(assigning double the work for the larger device). Though the two kernels seems to be put in their queues immediately, the second kernels seems to start execution only after the first completes. From the profiling information, it seems that the kernels are getting executed sequentially. Profile information given below. Note the queued time for the second kernel.
Profiling information:
Queued time: 0.334ms
Wait time: 21.751ms
Run time: 12246.8ms
Queued time: 12269.4ms
Wait time: 0.183916ms
Run time: 12494.5ms
Is this sequential execution expected ?
Thanks in advance,--Kiran
Hi Kiran,
Your code is roughly doing:
foreach device in devices: enqueue( queue[device], kernel ) foreach device in devices: finish(queue[device])
foreach device in devices:
enqueue( queue[device], kernel )
finish(queue[device])
The problem is you're not flushing the queues therefore the kernels only get sent to the GPU when clFinish is called which is not what you want.
If you add a clFlush after the enqueue then the jobs will get executed in parallel:
foreach device in devices: enqueue( queue[device], kernel ) clFlush( queue[device] ) foreach device in devices: finish(queue[device])
clFlush( queue[device] )
You should then see something like:
Profiling information: Queued time: 0.315959ms Wait time: 14.0343ms Run time: 15394.6ms Profiling information: Queued time: 14.0195ms Wait time: 0.079417ms Run time: 15331.2ms
Queued time: 0.315959ms
Wait time: 14.0343ms
Run time: 15394.6ms
Queued time: 14.0195ms
Wait time: 0.079417ms
Run time: 15331.2ms
Hope this helps,
Thanks a ton Anthony. Works fine with the flush.