This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM_import_memory API is very slow

I want to use clImportMemoryARM API to achieve zero copy between CPU and GPU. 

However, the performance is not what I expected. For a FHD image, it takes 4.4 ms for importing, almost identical to uploading explicitly.

Is this slow performance expected? I am using Mali G72 GPU.

Thanks,

-Shouwen

Parents
  • Hi Vijay,

    Thanks for the details.

    Looking again at the code you've shared, I think I understand why you're finding the import call slow.

    Linux over-commits memory which means that when you're calling malloc, the Linux kernel (via the C library) is just allocating a range of virtual addresses that aren't yet backed by physical memory pages. Physical pages are allocated lazily by the kernel the first time one virtual address in the corresponding range is accessed.

    clImportMemoryARM requires that all the backing pages have been allocated for the import to complete (so that there is no need to interrupt GPU work to allocate pages later on).

    Since you import the memory straight after the allocation, it means clImportMemoryARM will have to allocate and initialise (i.e. zero for security reasons) physical pages for the entirety of the allocation, which is where most of the time is spent.

    If you initialise the memory before the import (writing a single byte in each page, i.e. every 4kB, should be enough), you'll find that the import call takes a lot less time.

    Regards,

    Kévin

Reply
  • Hi Vijay,

    Thanks for the details.

    Looking again at the code you've shared, I think I understand why you're finding the import call slow.

    Linux over-commits memory which means that when you're calling malloc, the Linux kernel (via the C library) is just allocating a range of virtual addresses that aren't yet backed by physical memory pages. Physical pages are allocated lazily by the kernel the first time one virtual address in the corresponding range is accessed.

    clImportMemoryARM requires that all the backing pages have been allocated for the import to complete (so that there is no need to interrupt GPU work to allocate pages later on).

    Since you import the memory straight after the allocation, it means clImportMemoryARM will have to allocate and initialise (i.e. zero for security reasons) physical pages for the entirety of the allocation, which is where most of the time is spent.

    If you initialise the memory before the import (writing a single byte in each page, i.e. every 4kB, should be enough), you'll find that the import call takes a lot less time.

    Regards,

    Kévin

Children
No data