This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

clEnqueueMap takes long time giving no performance benefit of using CL_MEM_ALLOC_HOST_PTR

Hello everyone,

Based on OpenCL guide for Mali 600 gpu, CL_MEM_ALLOC_HOST_PTR should be used to remove any data copy and to improve performance.

Today, I was testing on memory copy time using CL_MEM_ALLOC_HOST_PTR on Arndale board having a Mali 604 gpu. I tested with CL_MEM_ALLOC_HOST_PTR and with clEnqueueWriteBuffer. I found that overall I do not get much performance improvement if I use CL_MEM_ALLOC_HOST_PTR. Because the clEnqueueMap function takes almost same time as clEnqueueWriteBuffer.

This test has been done on vector addition.

How I tested:

Instead of having a pointer created with malloc and transfer data to device, I created a buffer at first using CL_MEM_ALLOC_HOST_PTR . Then I mapped this buffer using OpenCL API. This returns a pointer and I filled the memory pointed by this pointer with data. The mapping mechanism in this case takes time. The mapping time is almost equal to clENqueuewritebuffer. So, from this example, I did not get any significant improvement using CL_MEM_ALLOC_HOST_PTR.

My question is, why mapping time is so big when I use CL_MEM_ALLOC_HOST_PTR?

Here is the performance measurements:

Element size: 10000000, Kernel : vector addition, All times are in microseconds

Normal read write buffer	time
Kernel QUEUE -> SUBMIT	76
Kernel SUBMIT -> START	2985
Kernel START -> END	34549
Kernel QUEUE -> END	37611
clReleaseMemObject	61169
buffer creation time	20
enqueue write buffer time	108019

CL_MEM_ALLOC_HOST_PTR-with direct data copying inside allocated buffer	time
Kernel QUEUE -> SUBMIT	80
kernel SUBMIT -> START	2904
kernel START -> END	34175
kernel QUEUE -> END	37161
clReleaseMemObject	51675
Filling the pointer returned by clEnqueueMap with data	208009
mapping time	81346
unmapping time	269

CL_MEM_ALLOC_HOST_PTR-with data copying from a malloc pointer to host alloc pointer using memcpy	Time
Kernel QUEUE -> SUBMIT	88
Kernel SUBMIT -> START	3142
Kernel START -> END	33950
Kernel QUEUE -> END	37181
clReleaseMemObject	56681
mapping time	64134
unmapping time	190
memcpy time (copy data from already created malloc pointer to host allocated pinned pointer)	56987

I have also attached the three version of vector addition codes with this post for your kind review.

6205.zip

Top replies

Anthony Barbier over 10 years ago in reply to Md Mainul Hassan +1 verified

Hi, 1. It is still the wrong way of doing things: in a real life appication you would allocate your buffers at initialisation time and then re-use them, you wouldn't free and allocate new ones at every...

Parents

0 Anthony Barbier over 10 years ago

As explained in the other thread.

1)The reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

In a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

createBuffer();

for(int i=0;i<100; i++)
{

timer_start();

map();

fill_buffer();

unmap();

enqueue_kernel();

finish():

timer_end();
}

releaseBuffer();

When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

2) Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Hope this helps,

Anthony
Cancel
Up 0 Down

Cancel

Reply

0 Anthony Barbier over 10 years ago

As explained in the other thread.

1)The reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

In a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

createBuffer();

for(int i=0;i<100; i++)
{

timer_start();

map();

fill_buffer();

unmap();

enqueue_kernel();

finish():

timer_end();
}

releaseBuffer();

When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

2) Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Hope this helps,

Anthony
Cancel
Up 0 Down

Cancel

Children

0 Md Mainul Hassan over 10 years ago in reply to Anthony Barbier

Hi Anthony,
Thanks for your reply.
1. I am comparing EnqueueMap with EnqueueWrite. In both cases I created buffer each time inside the loop. So, there should be overhead in both cases, right?
Why given with the same scenario, EnqueueMap is not faster than EnqueueWrite?
2. Mapping is done from the host side, right? if yes, then why CPU would be idle at that time. CPU should be busy mapping the memory.
3. About the last command, how did you find this commands and techniques? is there any document from where I can learn this also? Opencl guide for mali gpu does not have anything like that.
Cancel
Up 0 Down

Cancel
+1 Anthony Barbier over 10 years ago in reply to Md Mainul Hassan

Hi,
1. It is still the wrong way of doing things: in a real life appication you would allocate your buffers at initialisation time and then re-use them, you wouldn't free and allocate new ones at every frame because this would be really inefficient.
Your test application should try to be as close as possible from a real application, there isn't much value in benchmarking the initialisation and destruction of objects as it's not what will affect your application's performance.
1/2: My guess (It's only a guess though) is that enqueueWrite is faster because it relies on memcpy which will trigger the CPU governor up.
Indeed mapping is done by the CPU, however the CPU load depends on the memory bus frequency which is set by the same governor as the CPU, however this governor only takes into account CPU activity.
So if the CPU was sat doing nothing while waiting for the GPU to complete a job, then it will have been clocked down and so will the memory bus and as a result the cache maintenance will be slow.
Ideally the governor would monitor the memory bus activity too and clock back up both the CPU and the memory bus when needed, unfortunately this is not currently the case in the Linux kernel.
So the workaround is to set the CPU governor to "performance" so that both the CPU and memory bus stay clocked up.
3. cpu-freq is the standard way of controlling the CPU Frequency in the Linux Kernel: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
And here is an example of how to set the governor from userspace: CPU frequency scaling in Linux | iDebian's Weblog
Hope this helps,
Anthony
Cancel
Up +1 Down

Cancel