Hallo, I am working with a Rockchip platform (RK3568) with a Mali-G52 GPU and I am trying to understand the processing time of my openCL code. I simplified my kernel as much as possible so that it only does a copy of a buffer and I am measuring very high processing time, compared to the theoretical values I computed.
Here are the computations made :
Size of copied buffer : 14 587 776 Bytes
Announced memory bus frequency * width (LPDDR4-1600) : 1600 * 2 * 32 bits
So i get my theoretical value : Image_size / (bus Freq * width) = 1.14ms to read, the full buffer.
I doubled that value since I want to read + write back, so i get 2.28ms. I read that the efficiency of such DDR should be around 65-70%.
Now, when I use the OpenCL built-in function 'clEnqueueCopyBuffer', i get a processing time of 6.5ms, which is already more than double the theoretical time. When i write a kernel myself, that takes as input 2 buffers of said size, allocated by Host (CPU), I get a best case of 12.2ms, using SVM for both buffers.
Here are my kernel's parameters : global_work_size = Img_size/16 ; reading/writing 16 bytes at a time using vload16/vstore16 functions.
Additionaly, I used the following command to watch the GPU/DDR clocks while i was doing my tests (1000 copies in a row) :
root@rock-3a:/sys/kernel/debug/clk# cat clk_summary | grep gpu
root@rock-3a:/sys/kernel/debug/clk# cat clk_summary | grep ddr
And I noticed the given frequencies where never as high as the announced frequencies of 800MHz for GPU and 1600MHz for DDR.
So I am wondering if I am missing anything while profiling ? Or maybe a setting to "force" the GPU/DDR frequencies to go as high as possible ? Is there something wrong in my theoretical computations already ?
I also posted a support case, but i figured asking to the community might prove useful.
Thanks for you time and consideration, Virgile