I am trying to implement the copy frame kernel. I have a pointer to an image that I have to copy to the location given by the destination pointer. I can implement this with CPU, which will give me the best performance. but, because of power requirements, I am doing on GPU.
CPU time:2ms GPU time:24ms
Please review this GPU code and help in optimizing this.
//create buffer // repeat below code and create buffer variables for source and destination mem_flag |= CL_MEM_USE_HOST_PTR; buffers = clCreateBuffer(context_, mem_flag, mem_size, host_ptr, &error_code); global_size[2] = { (size_t) dst_w/8, (size_t) dst_h}; int ret = clEnqueueNDRangeKernel(queue_, kernel, 2, NULL, global_size, NULL, 0, NULL, &event_kernel); clFinish(queue_); //Kernal code // buf_src_y: Buffer pointer to source image buf_dst_y: Buffer pointer to destination image //buf_src_uv : buf_src + src_uv_offset buf_dst_uv : buf_dst + dst_uv_offset int x = get_global_id(0) * 8; int y = get_global_id(1); int src_pos = mad24(y, src_stride, x); int dst_pos = mad24(y, dst_stride, x); vstore8(vload8(0, buf_src_y + src_pos), 0, buf_dst_y + dst_pos); if (y < dst_uv_h) { vstore8(vload8(0, buf_src_uv + src_pos), 0, buf_dst_uv + dst_pos); }
Hi,
You're more likely to get useful help if you post the complete code.
What makes you think the GPU is more power efficient when it comes to memory copies?
Regards,
Kévin