This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Buffer create taking 10 ms on mali G-72

Hi,

I am working on a video solution code. where I have to provide source image to GPU and do computation and write in the destination. I read that using buffer creates in the loop every time will add GPU overhead.so, I implemented the following. but, still, I am facing the same performance issue. can someone help me?

//create a buffer for source and destination

clCreateBuffer(context_,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR, mem_size, NULL, &error_code);

while(recording){

clEnqueueWriteBuffer(queue_,buffer_ptr,CL_TRUE,0,memsize ,src_ptr,0, NULL,NUL):

global_size=(dst_w,dst_h);

clEnqueueNDRangeKernel(queue_, kernel, 2, NULL, global_size, NULL, 0, NULL, &event_kernel);

clEnqueueReadBuffer(queue_,buffer_dst_ptr,CL_TRUE,0,memsize,dst_y ,0, NULL,NULL);

}

My kernel is completely simple.It has very minimal computation.you can consider like its just copying image from source to desitnation.because of power constraints.I have to do it on GPU only.I know memory transfer overhead is there,but unable to find how to reduce it.

Top replies

Peter Harris over 4 years ago +1 verified

Hi Tarun, Firstly, why do you think the GPU is going to be lower power than the CPU for a simple memory copy operation? That seems like a big assumption which is unlikely to be true in practice; the...

0 Tarun Annapareddy over 4 years ago

In title by the word buffer create, I mean..all the steps(transferring data to kernel ) is taking 10 ms
Cancel
Up 0 Down

Cancel
+1 Peter Harris over 4 years ago

Hi Tarun,

Firstly, why do you think the GPU is going to be lower power than the CPU for a simple memory copy operation? That seems like a big assumption which is unlikely to be true in practice; the DRAM access energy is going to be the most expensive aspect of that and will dwarf any logic energy cost in the CPU.

The main problem with using the GPU for this is that you are likely using memory that is cached on the CPU to back the buffers. On devices without hardware CPU-to-GPU memory coherency (few do), the drivers will have to do manual cache maintenance when passing a buffer to the GPU (clean) and when reading back (invalidate) the result. Manual set-way cache maintenance is never fast for large buffers.

Cheers,
Pete
Cancel
Up +1 Down

Cancel
0 Tarun Annapareddy over 4 years ago in reply to Peter Harris

Hi Peter,

I understood your suggestion. I will test CPU power consumption. I have one more doubt..will G-72 support cl_arm_import_memory efficiently. When I tried implementing this instead of clcreatebuffer. It's also taking around the same time. According to my understanding, cl_arm_import_memory will map the data instead of copying it to the device . but how do map and copy both taking the same time?is it the actual time or some kind of GPU overhead and how to overcome those?
Cancel
Up 0 Down

Cancel
0 Peter Harris over 4 years ago in reply to Tarun Annapareddy

> is it the actual time or some kind of GPU overhead and how to overcome those?

As I said in my first post, it is likely that the majority of the time is going to be related to cache maintenance, synchronizing the CPU cache and main memory. This will have to happen for any memory that is cached on the CPU if your system-on-a-chip doesn't support hardware cache coherency.
Cancel
Up 0 Down

Cancel
0 Tarun Annapareddy over 4 years ago in reply to Peter Harris

Hi Peter,

I am working on Exynos 9611.is there a way to check, if it provides cache coherency? I want to get confirmation to move forward with that assumption. can you comment on my question regarding " cl_arm_import_memory" also?
Cancel
Up 0 Down

Cancel
0 Kévin Petit over 4 years ago

Hi Tarun,

As Pete pointed out cache maintenance could be a factor but before we jump on blaming that I'd like to understand a few things:

1. Is the 10ms for an entire iteration of your "while (recording)" loop?

2. How big a buffer are we talking about?

3. How much time is spent in the kernel alone? I suggest using OpenCL queue profiling (clGetEventProfilingInfo on event_kernel , CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START will give the total time spent on the device) to determine that.

Also, clEnqueueWriteBuffer and clEnqueueReadBuffer always perform a copy. Could the application use clEnqueueMapBuffer instead?

Regards,

Kévin
Cancel
Up 0 Down

Cancel
0 Tarun Annapareddy over 4 years ago in reply to Kévin Petit

Hi kevin

Thanks for replying.

1.Total exicution time of loop is 21ms and exicution time of kernel is 10ms.

2.Buffer create size is around 2800x1600 and i have to create for source and destination

3. I can use clenqueuemap.but i did not find any difference. In this case i have to write memcopy and transfer image from source ptr to mapped buffer region right? So..copy will always be there
Cancel
Up 0 Down

Cancel