This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory Optimization on Mali GPU

Hi everyone,

Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.

But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).

Can anyone confirm? Is it possible at all to have a zero copy?

It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.

So, data copy is needed. Is there a way to access the data directly from GPU without any copying?

PS: I have attached my code for your feedback.


UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.


This is the code snippet:


   #ifdef mem_alloc_host
   start = getTime();
   a_st=getTime();
   bufferA = clCreateBuffer(context,  CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_a=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferA,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   bufferB = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
   cl_float* src_b=(cl_float*)clEnqueueMapBuffer(commandQueue, bufferB,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) * ELE_NUM, 0, NULL, NULL, &err);
   clFinish(commandQueue);
   a_en=getTime();
   a_time=a_time+(a_en-a_st);
   pfill_s=getTime();
   for (int i = 0; i < ELE_NUM; i++){
   src_a[i] = 100.0;
   src_b[i] = 11.1;
   }
   pfill_e=getTime();
   pfill_time=pfill_time+(pfill_e-pfill_s);
   b_st=getTime();
   clEnqueueUnmapMemObject(commandQueue, bufferB, src_b, 0, NULL, NULL);
   clEnqueueUnmapMemObject(commandQueue, bufferA, src_a, 0, NULL, NULL);
   clFinish(commandQueue);
   b_en=getTime();
   b_time=b_time+(b_en-b_st);
   end = getTime();
   creat_buffer += (end-start);
   bufferC = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);

#endif

6176.zip
Parents
  • The second reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

    Also in a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

    createBuffer();

    for(int i=0;i<100; i++)
    {

    timer_start();

    map();

    fill_buffer();

    unmap();

    enqueue_kernel();

    finish():

    timer_end();
    }

    releaseBuffer();

    When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

Reply
  • The second reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.

    Also in a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like

    createBuffer();

    for(int i=0;i<100; i++)
    {

    timer_start();

    map();

    fill_buffer();

    unmap();

    enqueue_kernel();

    finish():

    timer_end();
    }

    releaseBuffer();

    When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.

Children
No data