This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

clEnqueueMap takes long time giving no performance benefit of using CL_MEM_ALLOC_HOST_PTR

Hello everyone,

Based on OpenCL guide for Mali 600 gpu, CL_MEM_ALLOC_HOST_PTR should be used to remove any data copy and to improve performance.

Today, I was testing on memory copy time using CL_MEM_ALLOC_HOST_PTR on Arndale board having a Mali 604 gpu. I tested with CL_MEM_ALLOC_HOST_PTR and with clEnqueueWriteBuffer. I found that overall I do not get much performance improvement if I use CL_MEM_ALLOC_HOST_PTR. Because the clEnqueueMap function takes almost same time as clEnqueueWriteBuffer.

This test has been done on vector addition.

How I tested:

Instead of having a pointer created with malloc and transfer data to device, I created a buffer at first using CL_MEM_ALLOC_HOST_PTR . Then I mapped this buffer using OpenCL API. This returns a pointer and I filled the memory pointed by this pointer with data. The mapping mechanism in this case takes time. The mapping time is almost equal to clENqueuewritebuffer. So, from this example, I did not get any significant improvement using  CL_MEM_ALLOC_HOST_PTR.

My question is, why mapping time is so big when I use CL_MEM_ALLOC_HOST_PTR?

Here is the performance measurements:

Element size: 10000000, Kernel : vector addition, All times are in microseconds

Normal read write buffertime
Kernel QUEUE -> SUBMIT76
Kernel SUBMIT -> START2985
Kernel START -> END34549
Kernel QUEUE -> END37611
buffer creation time20
enqueue write buffer time108019
CL_MEM_ALLOC_HOST_PTR-with direct data copying inside allocated buffertime
Kernel QUEUE -> SUBMIT80
kernel SUBMIT -> START2904
kernel START -> END34175
kernel QUEUE -> END37161
Filling the pointer returned by clEnqueueMap with data208009
mapping time81346
unmapping time269
CL_MEM_ALLOC_HOST_PTR-with data copying from a malloc pointer to host alloc pointer using memcpyTime
Kernel QUEUE -> SUBMIT88
Kernel SUBMIT -> START3142
Kernel START -> END33950
Kernel QUEUE -> END37181
mapping time64134
unmapping time190
memcpy time (copy data from already created malloc pointer  to host allocated pinned pointer)


I have also attached the three version of vector addition codes with this post for your kind review.