Hi,
Is there any way to speed the data copying from CPU buffers which are allocated using "malloc" to GPU accessible memory. currently I am using simple memcpy for copying data.
Thanks & Regards,
Narendra Kumar Chepuri.
Hi narendrakumar_chepuri,
It is much faster to share buffers between CPU and GPU by allocating memory using clCreateBuffer() rather than malloc() as it can then be accessed natively by both without the need for a copy.
Please see the Mali OpenCL SDK tutorial (Mali OpenCL SDK v1.1.0: Memory Buffers) on memory buffers for full information and information on a range of other OpenCL topics.
Hope this helps,
Rich
Hi Rich,
You suggested to use clCreateBuffer(), but problem here is the data will be spreded in different regions suppose assume that my kernel input data is denoted by letter X and remaining data which is not required for kernel (but finally it should be there in entire CPU+GPU output) is denoted by O.
O X O O O X X X O O O O O O O O O O
X X X O O O O O O O X O O O O O X O
X X X O O O O O O O X O O X O O X O
O X O O O X O X O O X O O X O O X O
X O O X O O O O X O O O X O O X O O
O O O O O X O O O O O O O O O X O X
X O X O O O O X O O O O O O O O O O
O O O O O O O O X O O O X O X O X O
X O X O X O O O O O X O O O O X O O
X O O X O O O X O O O O X O X O O O
O X O O O X O O O O O O O X O O X O
Like this multiple blocks will be there so we have implemented using clCreateBuffer() only but what we observed is due to accessing of data at random locations we are seeing an increase in execution time of kernel from 2 msec to 70 msec. I have already posted this issue and I haven't got proper explanation for this. Please refere my post in this group "ARM Mali-T628(Samsung Exynos Octa 5420 Board) GPU Cache issue in kernel". Even if you suggest some solution for cache issue also is fine for me.
Narendra Kumar.
Hi Narendra,
It's important to have the data you're using contiguous, in order to optimise the use of the cache and also because GPUs load 128 bits of data at the time, therefore if you access sparse data you're wasting a lot of the bandwidth.
If possible try to switch from an array of structure to a structure of arrays for your data organisation, it should help.
Thanks,
Anthony
Hi Anthony,
Thanks for your response but here I am not using any structures for storing input data, so for solving this Cache issues is there any other way,
Note: I am just using source pointer as argument using globalid in kernel I will access the source data which is at particular point.
Would you mind sharing some details on how the timing is obtained? If you were using the clGetProfilingInfo(), please include the parameters you used for the measurements as well.
Neil
Hi neiltan,
I am using gettimeofday() function which is a pre built function in C for profiling.
Narendra kumar.