I'm currently porting vision algorithms to OpenCL that is specifically target for the Mali T800 gpus. For this particular problem I'm running on the T-880 series.
I have several contiguous buffers of sizes 512x512 * (1, 2 and 4).
After the four clEnqueueNDRangeKernel() calls, I want to read the final result using clEnqueueMapBuffer().
What is weird is that if I increase the size of the output buffer and map / unmap it using clEnqueueMapBuffer() I get a severe performance hit.
Basically, I have a 1 ms total execution time that goes up to 6 ms if I increase the size of the output buffer by a factor of 10.
OBSERVE: nothing has changed between the two situations except the size of the output buffer.
The last kernel is pushing the results onto the output buffer using:
output[atom_inc(output_index)] = result;
I don't know how many elements that will end up in the output buffer except that I have a upper limit, which is why I want to have a relatively big output buffer. Say 512x512 * 8 bytes.
Have anyone encounted something similar? And what could be the cause of this?