Copy frame Taking more time on Mali GPU

I am trying to implement the copy frame kernel. I have a pointer to an image that I have to copy to the location given by the destination pointer. I can implement this with CPU, which will give me the best performance. but, because of power requirements, I am doing on GPU.

CPU time:2ms GPU time:24ms

Please review this GPU code and help in optimizing this.

//create buffer

// repeat below code and create buffer variables for  source and destination 

mem_flag |= CL_MEM_USE_HOST_PTR;
buffers = clCreateBuffer(context_, mem_flag, mem_size, host_ptr, &error_code);

global_size[2] = { (size_t) dst_w/8, (size_t) dst_h};

int ret = clEnqueueNDRangeKernel(queue_, kernel, 2, NULL, global_size, NULL, 0, NULL, &event_kernel);

//Kernal code

// buf_src_y: Buffer pointer to source image   buf_dst_y: Buffer pointer to destination image
//buf_src_uv : buf_src + src_uv_offset            buf_dst_uv : buf_dst + dst_uv_offset         

int x = get_global_id(0) * 8;
int y = get_global_id(1);

int src_pos = mad24(y, src_stride, x);
int dst_pos = mad24(y, dst_stride, x);
vstore8(vload8(0, buf_src_y + src_pos), 0, buf_dst_y + dst_pos);

if (y < dst_uv_h) {
vstore8(vload8(0, buf_src_uv + src_pos), 0, buf_dst_uv + dst_pos);
