Hello everyone,
Based on OpenCL guide for Mali 600 gpu, CL_MEM_ALLOC_HOST_PTR should be used to remove any data copy and to improve performance.
Today, I was testing on memory copy time using CL_MEM_ALLOC_HOST_PTR on Arndale board having a Mali 604 gpu. I tested with CL_MEM_ALLOC_HOST_PTR and with clEnqueueWriteBuffer. I found that overall I do not get much performance improvement if I use CL_MEM_ALLOC_HOST_PTR. Because the clEnqueueMap function takes almost same time as clEnqueueWriteBuffer.
This test has been done on vector addition.
How I tested:
Instead of having a pointer created with malloc and transfer data to device, I created a buffer at first using CL_MEM_ALLOC_HOST_PTR . Then I mapped this buffer using OpenCL API. This returns a pointer and I filled the memory pointed by this pointer with data. The mapping mechanism in this case takes time. The mapping time is almost equal to clENqueuewritebuffer. So, from this example, I did not get any significant improvement using CL_MEM_ALLOC_HOST_PTR.
My question is, why mapping time is so big when I use CL_MEM_ALLOC_HOST_PTR?
Here is the performance measurements:
Element size: 10000000, Kernel : vector addition, All times are in microseconds
56987
I have also attached the three version of vector addition codes with this post for your kind review.