This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Buffer create taking 10 ms on mali G-72

Hi,

I am working on a video solution code. where I have to provide source image to GPU and do computation and write in the destination. I read that using buffer creates in the loop every time will add GPU overhead.so, I implemented the following. but, still, I am facing the same performance issue. can someone help me?

//create a buffer for source and destination

clCreateBuffer(context_,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR, mem_size, NULL, &error_code);

while(recording){

 clEnqueueWriteBuffer(queue_,buffer_ptr,CL_TRUE,0,memsize ,src_ptr,0, NULL,NUL): 

global_size=(dst_w,dst_h); 

clEnqueueNDRangeKernel(queue_, kernel, 2, NULL, global_size, NULL, 0, NULL, &event_kernel);

 clEnqueueReadBuffer(queue_,buffer_dst_ptr,CL_TRUE,0,memsize,dst_y ,0, NULL,NULL);

}

My kernel is completely simple.It has very minimal computation.you can consider like its just copying image from source to desitnation.because of power constraints.I have to do it on GPU only.I know memory transfer overhead is there,but unable to find how to reduce it.

Parents
  • Hi Tarun,

    As Pete pointed out cache maintenance could be a factor but before we jump on blaming that I'd like to understand a few things:

    1. Is the 10ms for an entire iteration of your "while (recording)" loop?

    2. How big a buffer are we talking about?

    3. How much time is spent in the kernel alone? I suggest using OpenCL queue profiling (clGetEventProfilingInfo on event_kernel , CL_​PROFILING_​COMMAND_​END - CL_​PROFILING_​COMMAND_​START will give the total time spent on the device) to determine that.

    Also, clEnqueueWriteBuffer and clEnqueueReadBuffer always perform a copy. Could the application use clEnqueueMapBuffer instead?

    Regards,

    Kévin

Reply
  • Hi Tarun,

    As Pete pointed out cache maintenance could be a factor but before we jump on blaming that I'd like to understand a few things:

    1. Is the 10ms for an entire iteration of your "while (recording)" loop?

    2. How big a buffer are we talking about?

    3. How much time is spent in the kernel alone? I suggest using OpenCL queue profiling (clGetEventProfilingInfo on event_kernel , CL_​PROFILING_​COMMAND_​END - CL_​PROFILING_​COMMAND_​START will give the total time spent on the device) to determine that.

    Also, clEnqueueWriteBuffer and clEnqueueReadBuffer always perform a copy. Could the application use clEnqueueMapBuffer instead?

    Regards,

    Kévin

Children