This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

CPU to GPU Copying Speed tuning.

Hi,

Is there any way to speed the data copying from CPU buffers which are allocated using "malloc" to GPU accessible memory. currently I am using simple memcpy for copying data.

Thanks & Regards,

Narendra Kumar Chepuri.

Parents
  • Hi Rich,

    You suggested to use clCreateBuffer(), but problem here is the data will be spreded in different regions suppose assume that my kernel input data is denoted by letter X and remaining data which is not required for kernel (but finally it should be there in entire CPU+GPU output) is denoted by O.

    O X O O O X X X O O O O O O O O O O

    X X X O O O O O O O X O O O O O X O

    X X X O O O O O O O X O O X O O X O

    O X O O O X O X O O X O O X O O X O

    X O O X O O O O X O O O X O O X O O

    O O O O O X O O O O O O O O O X O X

    X O X O O O O X O O O O O O O O O O

    O O O O O O O O X O O O X O X O X O

    X O X O X O O O O O X O O O O X O O

    X O O X O O O X O O O O X O X O O O

    O X O O O X O O O O O O O X O O X O

    Like this multiple blocks will be there so we have implemented using clCreateBuffer() only but what we observed is due to accessing of data at random locations we are seeing an increase in execution time of kernel from 2 msec to 70 msec. I have already posted this issue and I haven't got proper explanation for this. Please refere my post in this group "ARM Mali-T628(Samsung Exynos Octa 5420 Board) GPU Cache issue in kernel". Even if you suggest some solution for cache issue also is fine for me.

    Thanks & Regards,

    Narendra Kumar.

Reply
  • Hi Rich,

    You suggested to use clCreateBuffer(), but problem here is the data will be spreded in different regions suppose assume that my kernel input data is denoted by letter X and remaining data which is not required for kernel (but finally it should be there in entire CPU+GPU output) is denoted by O.

    O X O O O X X X O O O O O O O O O O

    X X X O O O O O O O X O O O O O X O

    X X X O O O O O O O X O O X O O X O

    O X O O O X O X O O X O O X O O X O

    X O O X O O O O X O O O X O O X O O

    O O O O O X O O O O O O O O O X O X

    X O X O O O O X O O O O O O O O O O

    O O O O O O O O X O O O X O X O X O

    X O X O X O O O O O X O O O O X O O

    X O O X O O O X O O O O X O X O O O

    O X O O O X O O O O O O O X O O X O

    Like this multiple blocks will be there so we have implemented using clCreateBuffer() only but what we observed is due to accessing of data at random locations we are seeing an increase in execution time of kernel from 2 msec to 70 msec. I have already posted this issue and I haven't got proper explanation for this. Please refere my post in this group "ARM Mali-T628(Samsung Exynos Octa 5420 Board) GPU Cache issue in kernel". Even if you suggest some solution for cache issue also is fine for me.

    Thanks & Regards,

    Narendra Kumar.

Children