This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory Optimization on Mali GPU

Hi everyone,

Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.

But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).

Can anyone confirm? Is it possible at all to have a zero copy?

It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.

So, data copy is needed. Is there a way to access the data directly from GPU without any copying?

PS: I have attached my code for your feedback.


UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.


This is the code snippet:


   #ifdef mem_alloc_host
   start = getTime();
   a_st=getTime();
   bufferA = clCreateBuffer(context,  CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_a=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferA,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   bufferB = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_b=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferB,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   clFinish(commandQueue);
   a_en=getTime();
   a_time=a_time+(a_en-a_st);
   pfill_s=getTime();
   for (int i = 0; i < ELE_NUM; i++){
   src_a[i] = 100.0;
   src_b[i] = 11.1;
   }
   pfill_e=getTime();
   pfill_time=pfill_time+(pfill_e-pfill_s);
   b_st=getTime();
   clEnqueueUnmapMemObject(commandQueue, bufferB, src_b, 0, NULL, NULL);
   clEnqueueUnmapMemObject(commandQueue, bufferA, src_a, 0, NULL, NULL);
   clFinish(commandQueue);
   b_en=getTime();
   b_time=b_time+(b_en-b_st);
   end = getTime();
   creat_buffer += (end-start);
   bufferC = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);

#endif

6176.zip
  • > Is there a way to access the data directly from GPU without any copying?


    Depending on use case you may be able to directly populate the memory into the ALLOC_HOST_POINTER buffer - e.g. if your data is being decompressed you could pass the pointer for the ALLOC'd buffer into your decompression library so the decompressed output is streamed directly into the CL-visible memory.

     

    If you have a buffer in your application already so can't do the above, then you either have to copy in the application or copy in the drivers - the choice of API just means you can avoid copying twice.


    HTH,

    Pete


  • Hi,

    Today, I was testing on memory copy time using HOST_ALLOC_PTR  This test has been done on vector addition. I found that while using HOST_ALLOC_PTR, buffer mapping time becomes big which is close to using clEnqueueWriteBuffer.

    How I tested:

    I tried to use the first scenario you mentioned.

    Instead of having a pointer created with malloc and transfer data to device, I created a buffer at first using HOST_ALLOC_PTR . Then I mapped this buffer using OpenCL API. This returns a pointer and I filled the memory pointed by this pointer with data. The mapping mechanism in this case takes time. The mapping time is almost equal to clENqueuewritebuffer. So, from this example, I did not get any significant improvement using HOST_ALLOC_PTR.

    Could you please let me know if I am doing anything wrong? or if I am following the correct way of using HOST_ALLOC_PTR

  • Hi,

    Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

    So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

    Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

    echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Hope this helps,

    Anthony

  • The second reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

    Also in a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

    createBuffer();

    for(int i=0;i<100; i++)
    {

    timer_start();

    map();

    fill_buffer();

    unmap();

    enqueue_kernel();

    finish():

    timer_end();
    }

    releaseBuffer();

    When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

  • Dear mainul,

    As there's been no activity on this post for a couple of months I'm going to mark the question as assumed answered. Let us know if you want to open it back up.

    Thanks,

    Ellie

  • Hi Anthony,

    I have similar problem, but related to CPU load. I set  scaling_governor for performance, and ran the algorithm. And also i check the cpu_currFreq continuously. I am expecting it to be constant to max_frequency, but I am observing variations between min_freq to max_freq.

    How can I disable complete DVFS(we have kernel source) or make it run at max clock always?

    Thanks,

    Veeranna

  • Hi Veeranna,

    If you're indeed using the performance governor and the frequency drops then it means you reached the thermal threshold of the device.

    If it's a dev board you can try to add a heat sink on the chip.

    Thanks,

    Anthony

  • Hi Anthony,

    Yes it is development board Exynos5420 from insignal and we have heat sink on the chip. Does it mean heat sink is not capable enough?

    More ideas will be helpful.

    Thanks,

    Veeranna

  • Hi veerannah,


    It is worth remembering that the Arndale is a development board and is designed to be pushed to its limits. What you do on this hardware, may not work well on production devices due to different thermal and power constraints for each device.

    With that in mind, you can do things such as disable DVFS or clock the CPU/GPU etc to the highest supported frequencies. Since it is a non-battery powered device, power constraints are relaxed, but you still have thermal limitations to deal with... this can even be stricter than a production device since the SoC is exposed on a devboard and not helped by the form factor of a production device.

    In order to protect itself, even when DVFS has been disabled, when you reach a thermal limit that is deemed unsafe for normal operations, the SoC will start underclocking itself to try deal with the excess heat.

    You can do things to try help this along, if what you are interested in is to stress test the theoretical limitations of the hardware... such as using a heatsink, even a fan, or even going extreme with liquid cooling etc.

    Obviously, if you are interested in real world performance, this is not really the thing you should be pursuing, but rather you should be looking at ways you could optimise your code to reduce power consumption, which in turn will decrease the thermal issue.

    An example is bandwidth... by reducing bandwidth, you not only save power consumption, but also a lot on heat as well...

    If you have any further questions, please do let us know.

    Kind Regards,

    Michael McGeagh

  • Hi Michael,

    Thanks for the detailed reply.