Hi,
i'm porting an existing OpenCL-using application to ARM/Mali. It already runs, but performance could be better due to unneeded buffer copies.
The ideal OpenCL workflow seems to be:
Init: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .
Loop:
1. get a void* via clEnqueueMapBuffer
2. Use the void* to fill in data
3. UnMap
4. use the cl_mem object as parameter for a kernel.
but i have a problem in step 1: do i always get the same void* or can this pointer move over time? I basically need this:
Init 1: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .
Init 2: get a void* via clEnqueueMapBuffer
Init 3: pass the void* into a device driver (very expensive operation)
1. wait for the device driver to fill the buffer
2. UnMap
3. use the cl_mem object as parameter for a kernel.
4. clEnqueueMapBuffer, but ignore the new void* because Init 3 is very slow.
Can i really ignore the pointer returned in step 4 and use the first pointer returned in init 3 forever? Is this guaranteed for all Mali-OpenCL implementations?
Thanks
If what you ultimately want to do is process an image from a camera driver, it should be possible to achieve zero-copy provided the camera driver supports dma_buf (so including Android Ion allocations). The Mali driver supports a proprietary extension (https://www.khronos.org/registry/OpenCL/extensions/arm/cl_arm_import_memory.txt) that allows to create a CL buffer to wrap a dma_buf allocation.
L.Gilson said:The cache maintenance is done on map/unmap buffer, right?
Correct. As Pete is saying, there is no way around the CPU cache maintenance cost as long as the buffer is touched by the CPU on a platform without CPU/GPU cache coherency. However, if your platform supports IO-Coherency (enabled by CL_MEM_ALLOC_HOST_PTR), some cache maintenance can be avoided. Similarly, if the platform supports full hardware coherency and OpenCL 2.0, using fine-grain SVM allocations is likely to provide a performance benefit as no cache maintenance at all should be required.