This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL on Mali, MapBuffers and lifetime of void* pointers

Hi,

i'm porting an existing OpenCL-using application to ARM/Mali. It already runs, but performance could be better due to unneeded buffer copies.

The ideal OpenCL workflow seems to be:

Init: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .

Loop:

1. get a void* via clEnqueueMapBuffer

2. Use the void* to fill in data

3. UnMap

4. use the cl_mem object as parameter for a kernel.

but i have a problem in step 1: do i always get the same void* or can this pointer move over time? I basically need this:

Init 1: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .

Init 2: get a void* via clEnqueueMapBuffer

Init 3: pass the void* into a device driver (very expensive operation)

Loop:

1. wait for the device driver to fill the buffer

2. UnMap

3. use the cl_mem object as parameter for a kernel.

4. clEnqueueMapBuffer, but ignore the new void* because Init 3 is very slow.

Can i really ignore the pointer returned in step 4 and use the first pointer returned in init 3 forever? Is this guaranteed for all Mali-OpenCL implementations?

Thanks

Parents
  • Thanks a lot for the quick answer.

    I'm still wondering if it is possible to avoid at least one expensive operation: resetting the buffer pointer for the data producing device driver (camera driver). This operation is expensive (~50ms) because the camera needs to be stopped and restarted. 

    The cache maintenance is done on map/unmap buffer, right? Assuming the pointer is valid for the lifetime of the cl_mem object, map and unmap calls should only affect CPU/GPU cache. That implies i don't have to pass the pointer to the device driver every time i call map => 50ms less to worry about.

    I already have a pipeline setup with all cores (currently 8) running. The code can handle multiple GPU, each running multiple queues. But it is not ZeroCopy code.

Reply
  • Thanks a lot for the quick answer.

    I'm still wondering if it is possible to avoid at least one expensive operation: resetting the buffer pointer for the data producing device driver (camera driver). This operation is expensive (~50ms) because the camera needs to be stopped and restarted. 

    The cache maintenance is done on map/unmap buffer, right? Assuming the pointer is valid for the lifetime of the cl_mem object, map and unmap calls should only affect CPU/GPU cache. That implies i don't have to pass the pointer to the device driver every time i call map => 50ms less to worry about.

    I already have a pipeline setup with all cores (currently 8) running. The code can handle multiple GPU, each running multiple queues. But it is not ZeroCopy code.

Children
  • If what you ultimately want to do is process an image from a camera driver, it should be possible to achieve zero-copy provided the camera driver supports dma_buf (so including Android Ion allocations). The Mali driver supports a proprietary extension (https://www.khronos.org/registry/OpenCL/extensions/arm/cl_arm_import_memory.txt) that allows to create a CL buffer to wrap a dma_buf allocation.

    The cache maintenance is done on map/unmap buffer, right?

    Correct. As Pete is saying, there is no way around the CPU cache maintenance cost as long as the buffer is touched by the CPU on a platform without CPU/GPU cache coherency. However, if your platform supports IO-Coherency (enabled by CL_MEM_ALLOC_HOST_PTR), some cache maintenance can be avoided. Similarly, if the platform supports full hardware coherency and OpenCL 2.0, using fine-grain SVM allocations is likely to provide a performance benefit as no cache maintenance at all should be required.