This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL on Mali, MapBuffers and lifetime of void* pointers

Hi,

i'm porting an existing OpenCL-using application to ARM/Mali. It already runs, but performance could be better due to unneeded buffer copies.

The ideal OpenCL workflow seems to be:

Init: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .

Loop:

1. get a void* via clEnqueueMapBuffer

2. Use the void* to fill in data

3. UnMap

4. use the cl_mem object as parameter for a kernel.

but i have a problem in step 1: do i always get the same void* or can this pointer move over time? I basically need this:

Init 1: create a cl_mem object with CL_MEM_ALLOC_HOST_PTR .

Init 2: get a void* via clEnqueueMapBuffer

Init 3: pass the void* into a device driver (very expensive operation)

Loop:

1. wait for the device driver to fill the buffer

2. UnMap

3. use the cl_mem object as parameter for a kernel.

4. clEnqueueMapBuffer, but ignore the new void* because Init 3 is very slow.

Can i really ignore the pointer returned in step 4 and use the first pointer returned in init 3 forever? Is this guaranteed for all Mali-OpenCL implementations?

Thanks

Parents
  • Passing data from the host to the GPU is expensive primarily because of the need for CPU-side cache maintenance on platforms without some form of hardware cache coherency between the GPU and the CPU. This cost is going to be unavoidable even if the actual pointer is the same - if you don't flush the CPU caches on data exchange then you risk data corruption because the CPU and GPU views of the data are out of synchronization.

    The best workaround here is to pipeline the GPU processing and the CPU processing. Have two (or more) buffers, and while the CPU is processing and setting up one, have the GPU processing another.

    HTH, 
    Pete

Reply
  • Passing data from the host to the GPU is expensive primarily because of the need for CPU-side cache maintenance on platforms without some form of hardware cache coherency between the GPU and the CPU. This cost is going to be unavoidable even if the actual pointer is the same - if you don't flush the CPU caches on data exchange then you risk data corruption because the CPU and GPU views of the data are out of synchronization.

    The best workaround here is to pipeline the GPU processing and the CPU processing. Have two (or more) buffers, and while the CPU is processing and setting up one, have the GPU processing another.

    HTH, 
    Pete

Children
  • Thanks a lot for the quick answer.

    I'm still wondering if it is possible to avoid at least one expensive operation: resetting the buffer pointer for the data producing device driver (camera driver). This operation is expensive (~50ms) because the camera needs to be stopped and restarted. 

    The cache maintenance is done on map/unmap buffer, right? Assuming the pointer is valid for the lifetime of the cl_mem object, map and unmap calls should only affect CPU/GPU cache. That implies i don't have to pass the pointer to the device driver every time i call map => 50ms less to worry about.

    I already have a pipeline setup with all cores (currently 8) running. The code can handle multiple GPU, each running multiple queues. But it is not ZeroCopy code.

  • If what you ultimately want to do is process an image from a camera driver, it should be possible to achieve zero-copy provided the camera driver supports dma_buf (so including Android Ion allocations). The Mali driver supports a proprietary extension (https://www.khronos.org/registry/OpenCL/extensions/arm/cl_arm_import_memory.txt) that allows to create a CL buffer to wrap a dma_buf allocation.

    The cache maintenance is done on map/unmap buffer, right?

    Correct. As Pete is saying, there is no way around the CPU cache maintenance cost as long as the buffer is touched by the CPU on a platform without CPU/GPU cache coherency. However, if your platform supports IO-Coherency (enabled by CL_MEM_ALLOC_HOST_PTR), some cache maintenance can be avoided. Similarly, if the platform supports full hardware coherency and OpenCL 2.0, using fine-grain SVM allocations is likely to provide a performance benefit as no cache maintenance at all should be required.