I wish to allocate a vector and use it's data pointer to allocate a zero copy buffer on the GPU. There is this cl_arm_import_memory extension which can be used to do this. But I am not sure wether its supported for all mali midgard OpenCL drivers or not.
I was going through this link and I am quite puzzled by the following lines : -
If the extension string cl_arm_import_memory_host is exposed then importing
from normal userspace allocations (such as those created via malloc) is
supported.What exactly does these lines mean ? I am specifically working on rockchip's RK3399 boards. Kindly help.
Well, I queried ONE THE COMPUTE DEVICE using the CL_DEVICE_EXTENSIONS query string and I just saw the c;l_arm_import_memory string and not the cl_arm_import_memory_host string. What does this mean? Can I still use the extension function. Another possible workaround might be to first allcate a gpu side buffer using alloc_host_ptr and then write a custom allocator to map the gpu buffer pointer to a vector. But, I wish to know how safe is this? Will this work ?
I think the driver you have is incorrectly missing the cl_arm_import_memory_host extension string (there are unfortunately some drivers in the field with that issue, sorry for that). To be 100% sure you can try to run a small test that uses CL_IMPORT_TYPE_HOST_ARM. You will get CL_INVALID_PROPERTY if the feature is not supported.
In case the feature is not supported, you can allocate a buffer with CL_MEM_ALLOC_HOST_PTR and map it without a copy. I don't think you can safely use std::vector to interface with your application but you can probably write a custom container that would provide a suitable interface to your application and use the mapped pointer without a copy. It's hard to say precisely what is best though without detailed requirements.
Hi Kevin. Thanks for the reply. I used the clImportMmeoryArm function, as written in the documentation the default flag passed is CL_IMPORT_TYPE_HOST_ARM. Although the function returned successfully , when I remapped the buffer using clenqueuemapbuffer, the pointer returned by it differed from the original address which I passed to clImportMemoryArm function. I am not sure what's happening. My main goal is to use the same physical address space for a std::vector and cl::Buffer. The way I am planning to do that is by writing a custom allocator for the std::vector that will force it to point to the memory location of cl::Buffer allocated via USE_MEM_ALLOC_HOST_PTR. But , the issue is - acc to mali opencl guide, the mapping of a cl::Buffer differes between several invocations. So, I am at a loss as to how to maintain consistency between a vector and a cl::Buffer.
Right, calling clEnqueueMapBuffer on an imported buffer is forbidden by the spec but some older versions of the driver didn't reject the call. If the clImportMemoryARM function returned successfully, then your platform has support and I suggest you do the following:
- Use a custom allocator with your std::vector's that guarantees that memory is aligned to 64 bytes (cache line alignment, as per the extension specification).
- Import the pointer returned by vec.data() using clImportMemoryARM with CL_IMPORT_TYPE_HOST_ARM.
This obviously only works if the allocation backing the vector doesn't change after the import into OpenCL so you need to reserve enough space upfront (e.g. push_back may end up reallocating). To maintain data consistency, you have nothing to do, it's all managed by the OpenCL runtime.
Thanks Kevin for the prompt reply. Really appreciate it.
Hi, I am seeing a performance difference between when I allocate cl_mem using arm_import_memory and when I allocate using CL_MEM_ALLOC_HOST_PTR. The kernel execution time decreases by 10% when buffer is allocated by passing the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer() function. Is this the expected behaviour ? and is there any workaround for it?
This is expected behaviour. What you are likely measuring (I can confirm if you tell me exactly how you're measuring this) is the cost of maintaining data consistency between the CPU and GPU.
Conceptually, running a kernel on imported host memory has roughly the same cost as unmapping a buffer, running the kernel and mapping the buffer on the CPU again.
You can reduce that cost to a minimum by batching kernels into as few flush groups as possible. Later drivers are better at this.
View all questions in Graphics and Gaming forum