Hi everyone,
Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.
But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).
Can anyone confirm? Is it possible at all to have a zero copy?
It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.
So, data copy is needed. Is there a way to access the data directly from GPU without any copying?
PS: I have attached my code for your feedback.
UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.
This is the code snippet:
#endif
The second reason it's slow is because some initialisation operations are deferred to the first time an object is actually used.
Also in a real life application you would allocate your buffers once then map/unmap them at every frame, so if you want to make a realistic test case you should do something like
createBuffer();
for(int i=0;i<100; i++){
timer_start();
map();
fill_buffer();
unmap();
enqueue_kernel();
finish():
timer_end();}
releaseBuffer();
When doing that you should observe that the first iteration will take more time because of what I explained above, then all the following iterations should be much faster.