This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

clEnqueueMap takes long time giving no performance benefit of using CL_MEM_ALLOC_HOST_PTR

Hello everyone,

Based on OpenCL guide for Mali 600 gpu, CL_MEM_ALLOC_HOST_PTR should be used to remove any data copy and to improve performance.

Today, I was testing on memory copy time using CL_MEM_ALLOC_HOST_PTR on Arndale board having a Mali 604 gpu. I tested with CL_MEM_ALLOC_HOST_PTR and with clEnqueueWriteBuffer. I found that overall I do not get much performance improvement if I use CL_MEM_ALLOC_HOST_PTR. Because the clEnqueueMap function takes almost same time as clEnqueueWriteBuffer.

This test has been done on vector addition.

How I tested:

Instead of having a pointer created with malloc and transfer data to device, I created a buffer at first using CL_MEM_ALLOC_HOST_PTR . Then I mapped this buffer using OpenCL API. This returns a pointer and I filled the memory pointed by this pointer with data. The mapping mechanism in this case takes time. The mapping time is almost equal to clENqueuewritebuffer. So, from this example, I did not get any significant improvement using  CL_MEM_ALLOC_HOST_PTR.

My question is, why mapping time is so big when I use CL_MEM_ALLOC_HOST_PTR?

Here is the performance measurements:

Element size: 10000000, Kernel : vector addition, All times are in microseconds

Normal read write buffertime
Kernel QUEUE -> SUBMIT76
Kernel SUBMIT -> START2985
Kernel START -> END34549
Kernel QUEUE -> END37611
clReleaseMemObject61169
buffer creation time20
enqueue write buffer time108019
CL_MEM_ALLOC_HOST_PTR-with direct data copying inside allocated buffertime
Kernel QUEUE -> SUBMIT80
kernel SUBMIT -> START2904
kernel START -> END34175
kernel QUEUE -> END37161
clReleaseMemObject51675
Filling the pointer returned by clEnqueueMap with data208009
mapping time81346
unmapping time269
CL_MEM_ALLOC_HOST_PTR-with data copying from a malloc pointer to host alloc pointer using memcpyTime
Kernel QUEUE -> SUBMIT88
Kernel SUBMIT -> START3142
Kernel START -> END33950
Kernel QUEUE -> END37181
clReleaseMemObject56681
mapping time64134
unmapping time190
memcpy time (copy data from already created malloc pointer  to host allocated pinned pointer)

56987

I have also attached the three version of vector addition codes with this post for your kind review.

6205.zip
Parents
  • As explained in the other thread.

    1)The reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

    In a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

    createBuffer();

    for(int i=0;i<100; i++)
    {

    timer_start();

    map();

    fill_buffer();

    unmap();

    enqueue_kernel();

    finish():

    timer_end();
    }

    releaseBuffer();

    When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

    2) Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

    So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

    Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

    echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Hope this helps,

    Anthony

Reply
  • As explained in the other thread.

    1)The reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

    In a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

    createBuffer();

    for(int i=0;i<100; i++)
    {

    timer_start();

    map();

    fill_buffer();

    unmap();

    enqueue_kernel();

    finish():

    timer_end();
    }

    releaseBuffer();

    When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

    2) Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

    So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

    Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

    echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Hope this helps,

    Anthony

Children