Support forums

Graphics, Gaming, and VR forum Why do i get clEnqueueMapBuffer() performance hit?

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 1 reply
Subscribers 136 subscribers
Views 4811 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why do i get clEnqueueMapBuffer() performance hit?

Mikael over 8 years ago

I'm currently porting vision algorithms to OpenCL that is specifically target for the Mali T800 gpus. For this particular problem I'm running on the T-880 series.

I have several contiguous buffers of sizes 512x512 * (1, 2 and 4).

After the four clEnqueueNDRangeKernel() calls, I want to read the final result using clEnqueueMapBuffer().

What is weird is that if I increase the size of the output buffer and map / unmap it using clEnqueueMapBuffer() I get a severe performance hit.

Basically, I have a 1 ms total execution time that goes up to 6 ms if I increase the size of the output buffer by a factor of 10.

OBSERVE: nothing has changed between the two situations except the size of the output buffer.

The last kernel is pushing the results onto the output buffer using:

output[atom_inc(output_index)] = result;

I don't know how many elements that will end up in the output buffer except that I have a upper limit, which is why I want to have a relatively big output buffer. Say 512x512 * 8 bytes.

Have anyone encounted something similar? And what could be the cause of this?

Parents

+1 Peter Harris over 8 years ago

In addition to setting up a CPU-side mapping (may require MMU operations if the mapping doesn't already exist), the driver has to make sure that the output buffer is memory coherent with the GPU (i.e. no dirty lines in the CPU cache) before the job starts, and memory coherent after the job finishes (i.e. no stale lines in the CPU cache which may contain old data).
In a system without hardware IO coherency (possible with Mali, but depends on system integration) we need to do manual cache maintenance operations to force these steps to happen. Bigger buffer = longer cache maintenance operations.
Pete
Cancel
Up 0 Down

Cancel

Reply

+1 Peter Harris over 8 years ago

In addition to setting up a CPU-side mapping (may require MMU operations if the mapping doesn't already exist), the driver has to make sure that the output buffer is memory coherent with the GPU (i.e. no dirty lines in the CPU cache) before the job starts, and memory coherent after the job finishes (i.e. no stale lines in the CPU cache which may contain old data).
In a system without hardware IO coherency (possible with Mali, but depends on system integration) we need to do manual cache maintenance operations to force these steps to happen. Bigger buffer = longer cache maintenance operations.
Pete
Cancel
Up 0 Down

Cancel

Children

No data