I want to use clImportMemoryARM API to achieve zero copy between CPU and GPU.
However, the performance is not what I expected. For a FHD image, it takes 4.4 ms for importing, almost identical to uploading explicitly.
Is this slow performance expected? I am using Mali G72 GPU.
Thanks,
-Shouwen
Hi Shouwen,
Thanks for the feedback. Are you using the dma_buf or host memory path? Are multiple kernels using the same imported buffer as part of a single flush? What DDK version are you using?
Regards,
Kévin
HI, Kévin,
I am not using dma_buf, but host memory. It is just one kernel using the imported buffer. DDK version is not clear to me for now. But I am using a commercial Huawei Mate 10 phone which just released in last October.
My usage case is as below.
char *buffer = malloc(1080x1920x4);cl_mem buffer = clImportMemoryARM( ctx, CL_MEM_READ_WRITE, NULL, buffer, 1080x1920x4, &error );The profiling result was showing that the clImportMemoryARM API standalone took 4ms to complete.Thanks,-Shouwen
Sorry for the delay (and the disappointing performance).
You're doing nothing wrong and this is definitely not the kind of performance we're expecting.
You can get the DDK version using the device info queries (clGetDeviceInfo with CL_DEVICE_VERSION). This would be really useful information for us.
To better understand your use-case and how we can help, could you answer the following questions please?
- What is your primary driver for wanting to use user memory imports? Maybe we can help you find an alternative.
- Are you writing a third party Android applications that you're testing on a Mate10?
- Do you have access to more detailed information about the phone's internals?
- Do you have access to the Mali driver source? If yes, I would encourage you to raise a support case with ARM.
If there's anything more we can do to help, please let us know.
Hi Kevin,
I have raised a support case for the above issue with more details to your question.
Vijay
Hi Vijay,
Thanks for the details.
Looking again at the code you've shared, I think I understand why you're finding the import call slow.
Linux over-commits memory which means that when you're calling malloc, the Linux kernel (via the C library) is just allocating a range of virtual addresses that aren't yet backed by physical memory pages. Physical pages are allocated lazily by the kernel the first time one virtual address in the corresponding range is accessed.
clImportMemoryARM requires that all the backing pages have been allocated for the import to complete (so that there is no need to interrupt GPU work to allocate pages later on).
Since you import the memory straight after the allocation, it means clImportMemoryARM will have to allocate and initialise (i.e. zero for security reasons) physical pages for the entirety of the allocation, which is where most of the time is spent.
If you initialise the memory before the import (writing a single byte in each page, i.e. every 4kB, should be enough), you'll find that the import call takes a lot less time.