This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why do i get clEnqueueMapBuffer() performance hit?

I'm currently porting vision algorithms to OpenCL that is specifically target for the Mali T800 gpus. For this particular problem I'm running on the T-880 series.

I have several contiguous buffers of sizes 512x512 * (1, 2 and 4).

After the four clEnqueueNDRangeKernel() calls, I want to read the final result using clEnqueueMapBuffer().

What is weird is that if I increase the size of the output buffer and map / unmap it using clEnqueueMapBuffer() I get a severe performance hit.

Basically, I have a 1 ms total execution time that goes up to 6 ms if I increase the size of the output buffer by a factor of 10.

OBSERVE: nothing has changed between the two situations except the size of the output buffer.

The last kernel is pushing the results onto the output buffer using:

output[atom_inc(output_index)] = result;

I don't know how many elements that will end up in the output buffer except that I have a upper limit, which is why I want to have a relatively big output buffer. Say 512x512 * 8 bytes.

Have anyone encounted something similar? And what could be the cause of this?

0