This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory Optimization on Mali GPU

Hi everyone,

Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.

But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).

Can anyone confirm? Is it possible at all to have a zero copy?

It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.

So, data copy is needed. Is there a way to access the data directly from GPU without any copying?

PS: I have attached my code for your feedback.


UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.


This is the code snippet:


   #ifdef mem_alloc_host
   start = getTime();
   a_st=getTime();
   bufferA = clCreateBuffer(context,  CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_a=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferA,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   bufferB = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_b=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferB,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   clFinish(commandQueue);
   a_en=getTime();
   a_time=a_time+(a_en-a_st);
   pfill_s=getTime();
   for (int i = 0; i < ELE_NUM; i++){
   src_a[i] = 100.0;
   src_b[i] = 11.1;
   }
   pfill_e=getTime();
   pfill_time=pfill_time+(pfill_e-pfill_s);
   b_st=getTime();
   clEnqueueUnmapMemObject(commandQueue, bufferB, src_b, 0, NULL, NULL);
   clEnqueueUnmapMemObject(commandQueue, bufferA, src_a, 0, NULL, NULL);
   clFinish(commandQueue);
   b_en=getTime();
   b_time=b_time+(b_en-b_st);
   end = getTime();
   creat_buffer += (end-start);
   bufferC = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);

#endif

6176.zip
Parents
  • > Is there a way to access the data directly from GPU without any copying?


    Depending on use case you may be able to directly populate the memory into the ALLOC_HOST_POINTER buffer - e.g. if your data is being decompressed you could pass the pointer for the ALLOC'd buffer into your decompression library so the decompressed output is streamed directly into the CL-visible memory.

     

    If you have a buffer in your application already so can't do the above, then you either have to copy in the application or copy in the drivers - the choice of API just means you can avoid copying twice.


    HTH,

    Pete


Reply
  • > Is there a way to access the data directly from GPU without any copying?


    Depending on use case you may be able to directly populate the memory into the ALLOC_HOST_POINTER buffer - e.g. if your data is being decompressed you could pass the pointer for the ALLOC'd buffer into your decompression library so the decompressed output is streamed directly into the CL-visible memory.

     

    If you have a buffer in your application already so can't do the above, then you either have to copy in the application or copy in the drivers - the choice of API just means you can avoid copying twice.


    HTH,

    Pete


Children