Graphics, Gaming, and VR forum OpenCL : Can the two GPU devices in Mali-T628 work in parallel

State Accepted Answer
Locked Locked
Replies 7 replies
Subscribers 136 subscribers
Views 9283 views
Users 0 members are here

Options

Related

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL : Can the two GPU devices in Mali-T628 work in parallel

Kiran Chandramohan over 9 years ago

Hi,

I have an Odroid XU3 board. And i am trying to program the Mali-T628 GPU on this board with OpenCL.

With the devices example that comes with the Mali SDK, I came to understand that there are two GPU devices on the Mali-T628. One device with 4 compute units and another device with 2 compute units. I was able to use these devices separately and found that the device with 2 compute units is slower than the device with 4 compute units. But I could not get them to run in parallel.

I created separate command queues for these devices and enqueued the kernels(assigning double the work for the larger device). Though the two kernels seems to be put in their queues immediately, the second kernels seems to start execution only after the first completes. From the profiling information, it seems that the kernels are getting executed sequentially. Profile information given below. Note the queued time for the second kernel.

Profiling information:

Queued time: 0.334ms

Wait time: 21.751ms

Run time: 12246.8ms

Profiling information:

Queued time: 12269.4ms

Wait time: 0.183916ms

Run time: 12494.5ms

Is this sequential execution expected ?

Thanks in advance,
--Kiran

Parents

0 Anthony Barbier over 9 years ago in reply to Kiran Chandramohan

Hi,
I can't see anything wrong with the part you pasted, could you please paste he rest of the host code?
I'll try to run it tomorrow.
Thanks
Cancel
Up 0 Down

Cancel

Reply

0 Anthony Barbier over 9 years ago in reply to Kiran Chandramohan

Hi,
I can't see anything wrong with the part you pasted, could you please paste he rest of the host code?
I'll try to run it tomorrow.
Thanks
Cancel
Up 0 Down

Cancel

Children

0 Kiran Chandramohan over 9 years ago in reply to Anthony Barbier

Hi,
You can find the code in the link given below.
two_device
I tried my OpenCL code on an Nvidia Titan GPU with two devices and the kernels seem to be running in parallel on that GPU. Profiling information given below.
Profiling information:
Queued time:    0.002688ms
Wait time:      0.00464ms
Run time:       107.821ms
Profiling information:
Queued time:    0.005056ms
Wait time:      0.121632ms
Run time:       54.1882ms
--Kiran
Cancel
Up 0 Down

Cancel
0 Anthony Barbier over 9 years ago in reply to Kiran Chandramohan

Hi Kiran,
Your code is roughly doing:

foreach device in devices:

enqueue( queue[device], kernel )

foreach device in devices:

   finish(queue[device])

The problem is you're not flushing the queues therefore the kernels only get sent to the GPU when clFinish is called which is not what you want.
If you add a clFlush after the enqueue then the jobs will get executed in parallel:

foreach device in devices:

enqueue( queue[device], kernel )

clFlush( queue[device] )

foreach device in devices:

   finish(queue[device])

You should then see something like:

Profiling information:

Queued time:    0.315959ms

Wait time:      14.0343ms

Run time:       15394.6ms

Profiling information:

Queued time:    14.0195ms

Wait time:      0.079417ms

Run time:       15331.2ms

Hope this helps,
Cancel
Up 0 Down

Cancel
0 Kiran Chandramohan over 9 years ago in reply to Anthony Barbier

Thanks a ton Anthony. Works fine with the flush.
Cancel
Up 0 Down

Cancel