This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why do i get clEnqueueMapBuffer() performance hit?

I'm currently porting vision algorithms to OpenCL that is specifically target for the Mali T800 gpus. For this particular problem I'm running on the T-880 series.

I have several contiguous buffers of sizes 512x512 * (1, 2 and 4).

After the four clEnqueueNDRangeKernel() calls, I want to read the final result using clEnqueueMapBuffer().

What is weird is that if I increase the size of the output buffer and map / unmap it using clEnqueueMapBuffer() I get a severe performance hit.

Basically, I have a 1 ms total execution time that goes up to 6 ms if I increase the size of the output buffer by a factor of 10.

OBSERVE: nothing has changed between the two situations except the size of the output buffer.

The last kernel is pushing the results onto the output buffer using:

output[atom_inc(output_index)] = result;

I don't know how many elements that will end up in the output buffer except that I have a upper limit, which is why I want to have a relatively big output buffer. Say 512x512 * 8 bytes.

Have anyone encounted something similar? And what could be the cause of this?

Parents
  • In addition to setting up a CPU-side mapping (may require MMU operations if the mapping doesn't already exist), the driver has to make sure that the output buffer is memory coherent with the GPU (i.e. no dirty lines in the CPU cache) before the job starts, and memory coherent after the job finishes (i.e. no stale lines in the CPU cache which may contain old data).

    In a system without hardware IO coherency (possible with Mali, but depends on system integration) we need to do manual cache maintenance operations to force these steps to happen. Bigger buffer = longer cache maintenance operations.

    Pete

Reply
  • In addition to setting up a CPU-side mapping (may require MMU operations if the mapping doesn't already exist), the driver has to make sure that the output buffer is memory coherent with the GPU (i.e. no dirty lines in the CPU cache) before the job starts, and memory coherent after the job finishes (i.e. no stale lines in the CPU cache which may contain old data).

    In a system without hardware IO coherency (possible with Mali, but depends on system integration) we need to do manual cache maintenance operations to force these steps to happen. Bigger buffer = longer cache maintenance operations.

    Pete

Children
No data