I am trying to implement the copy frame kernel. I have a pointer to an image that I have to copy to the location given by the destination pointer. I can implement this with CPU, which will give me the best performance. but, because of power requirements, I am doing on GPU.
CPU time:2ms GPU time:24ms
Please review this GPU code and help in optimizing this.
//create buffer // repeat below code and create buffer variables for source and destination mem_flag |= CL_MEM_USE_HOST_PTR; buffers = clCreateBuffer(context_, mem_flag, mem_size, host_ptr, &error_code); global_size[2] = { (size_t) dst_w/8, (size_t) dst_h}; int ret = clEnqueueNDRangeKernel(queue_, kernel, 2, NULL, global_size, NULL, 0, NULL, &event_kernel); clFinish(queue_); //Kernal code // buf_src_y: Buffer pointer to source image buf_dst_y: Buffer pointer to destination image //buf_src_uv : buf_src + src_uv_offset buf_dst_uv : buf_dst + dst_uv_offset int x = get_global_id(0) * 8; int y = get_global_id(1); int src_pos = mad24(y, src_stride, x); int dst_pos = mad24(y, dst_stride, x); vstore8(vload8(0, buf_src_y + src_pos), 0, buf_dst_y + dst_pos); if (y < dst_uv_h) { vstore8(vload8(0, buf_src_uv + src_pos), 0, buf_dst_uv + dst_pos); }
Hi,
You're more likely to get useful help if you post the complete code.
What makes you think the GPU is more power efficient when it comes to memory copies?
Regards,
Kévin
Hello.Your code is not complete, so it cannot help you. When you are finished, you will receive more comments