This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

CPU to GPU Copying Speed tuning.

Hi,

Is there any way to speed the data copying from CPU buffers which are allocated using "malloc" to GPU accessible memory. currently I am using simple memcpy for copying data.

Thanks & Regards,

Narendra Kumar Chepuri.

Parents Reply Children
  • Hi Rich,

    You suggested to use clCreateBuffer(), but problem here is the data will be spreded in different regions suppose assume that my kernel input data is denoted by letter X and remaining data which is not required for kernel (but finally it should be there in entire CPU+GPU output) is denoted by O.

    O X O O O X X X O O O O O O O O O O

    X X X O O O O O O O X O O O O O X O

    X X X O O O O O O O X O O X O O X O

    O X O O O X O X O O X O O X O O X O

    X O O X O O O O X O O O X O O X O O

    O O O O O X O O O O O O O O O X O X

    X O X O O O O X O O O O O O O O O O

    O O O O O O O O X O O O X O X O X O

    X O X O X O O O O O X O O O O X O O

    X O O X O O O X O O O O X O X O O O

    O X O O O X O O O O O O O X O O X O

    Like this multiple blocks will be there so we have implemented using clCreateBuffer() only but what we observed is due to accessing of data at random locations we are seeing an increase in execution time of kernel from 2 msec to 70 msec. I have already posted this issue and I haven't got proper explanation for this. Please refere my post in this group "ARM Mali-T628(Samsung Exynos Octa 5420 Board) GPU Cache issue in kernel". Even if you suggest some solution for cache issue also is fine for me.

    Thanks & Regards,

    Narendra Kumar.

  • Hi Narendra,

    It's important to have the data you're using contiguous, in order to optimise the use of the cache and also because GPUs load 128 bits of data at the time, therefore if you access sparse data you're wasting a lot of the bandwidth.

    If possible try to switch from an array of structure to a structure of arrays for your data organisation, it should help.

    Thanks,

    Anthony

  • Hi Anthony,

                            Thanks for your response but here I am not using any structures for storing input data, so for solving this Cache issues is there any other way,

    Note: I am just using source pointer as argument using globalid  in kernel I will access the source data which is at particular point.

    Thanks & Regards,

    Narendra Kumar.

  • Hi Narendra,

      Would you mind sharing some details on how the timing is obtained? If you were using the clGetProfilingInfo(), please include the parameters you used for the measurements as well.

    Thanks,

    Neil

  • Hi neiltan,

    I am using gettimeofday() function which is a pre built function in C for profiling.

    Thanks & Regards,

    Narendra kumar.