This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory Optimization on Mali GPU

Hi everyone,

Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.

But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).

Can anyone confirm? Is it possible at all to have a zero copy?

It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.

So, data copy is needed. Is there a way to access the data directly from GPU without any copying?

PS: I have attached my code for your feedback.


UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.


This is the code snippet:


   #ifdef mem_alloc_host
   start = getTime();
   a_st=getTime();
   bufferA = clCreateBuffer(context,  CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_a=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferA,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   bufferB = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_b=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferB,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   clFinish(commandQueue);
   a_en=getTime();
   a_time=a_time+(a_en-a_st);
   pfill_s=getTime();
   for (int i = 0; i < ELE_NUM; i++){
   src_a[i] = 100.0;
   src_b[i] = 11.1;
   }
   pfill_e=getTime();
   pfill_time=pfill_time+(pfill_e-pfill_s);
   b_st=getTime();
   clEnqueueUnmapMemObject(commandQueue, bufferB, src_b, 0, NULL, NULL);
   clEnqueueUnmapMemObject(commandQueue, bufferA, src_a, 0, NULL, NULL);
   clFinish(commandQueue);
   b_en=getTime();
   b_time=b_time+(b_en-b_st);
   end = getTime();
   creat_buffer += (end-start);
   bufferC = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);

#endif

6176.zip
Parents
  • Hi,

    Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

    So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

    Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

    echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Hope this helps,

    Anthony

Reply
  • Hi,

    Most of the mapping time is actually spent doing cache maintenance which heavily depends on the CPU frequency.

    So if the mapping operations take a lot of time it's likely due to the power policy on your device having detected the CPU was idle and as a result clocked down the CPU and the memory system frequencies.

    Therefore, to ensure maximum efficiency try to force your CPU power policy to performance:

    echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Hope this helps,

    Anthony

Children
  • Hi Anthony,

    I have similar problem, but related to CPU load. I set  scaling_governor for performance, and ran the algorithm. And also i check the cpu_currFreq continuously. I am expecting it to be constant to max_frequency, but I am observing variations between min_freq to max_freq.

    How can I disable complete DVFS(we have kernel source) or make it run at max clock always?

    Thanks,

    Veeranna

  • Hi Veeranna,

    If you're indeed using the performance governor and the frequency drops then it means you reached the thermal threshold of the device.

    If it's a dev board you can try to add a heat sink on the chip.

    Thanks,

    Anthony

  • Hi Anthony,

    Yes it is development board Exynos5420 from insignal and we have heat sink on the chip. Does it mean heat sink is not capable enough?

    More ideas will be helpful.

    Thanks,

    Veeranna

  • Hi veerannah,


    It is worth remembering that the Arndale is a development board and is designed to be pushed to its limits. What you do on this hardware, may not work well on production devices due to different thermal and power constraints for each device.

    With that in mind, you can do things such as disable DVFS or clock the CPU/GPU etc to the highest supported frequencies. Since it is a non-battery powered device, power constraints are relaxed, but you still have thermal limitations to deal with... this can even be stricter than a production device since the SoC is exposed on a devboard and not helped by the form factor of a production device.

    In order to protect itself, even when DVFS has been disabled, when you reach a thermal limit that is deemed unsafe for normal operations, the SoC will start underclocking itself to try deal with the excess heat.

    You can do things to try help this along, if what you are interested in is to stress test the theoretical limitations of the hardware... such as using a heatsink, even a fan, or even going extreme with liquid cooling etc.

    Obviously, if you are interested in real world performance, this is not really the thing you should be pursuing, but rather you should be looking at ways you could optimise your code to reduce power consumption, which in turn will decrease the thermal issue.

    An example is bandwidth... by reducing bandwidth, you not only save power consumption, but also a lot on heat as well...

    If you have any further questions, please do let us know.

    Kind Regards,

    Michael McGeagh

  • Hi Michael,

    Thanks for the detailed reply.