Hi,
I've gotten most of my code working with the Chromebook development platform. Simple kernels run fine and show a nice performance speedup. However more complex kernels perform poorly.
As per the suggestions of some members of this forum, I've reorganized one of my OpenCL kernels to avoid using local memory and to use the float4 vectors. The routine I was optimizing was a simple covariance calculation for large data sets. Since this is essentially a modified matrix multiply, I used the provided sgemm.cl as an example. This resulted in a pretty substantial speed up in my code (about a factor of 2-3).
Since I can't guarantee that a client's system has a function OpenCL system, I have a standard CPU version of the same function. Running both of these on the chromebook I get the following performance figures for a data set:
Using local memory (i.e. optimized for nVidia): 24.5 seconds
Using vectors and no local memory (i.e. optimized for Mali): 9.26 seconds
Using CPUs with calls to BLAS (no OpenCL): 6.99 seconds
So here despite optimization effort, it is still quicker to use the CPU to do the processing.
Just checking to see if I did anything wrong, I checked the provided sgemm.cpp file. As a comparison, I ran the BLAS CPU version of sgemm (installed using apt-get) against the provided OpenCL example. For the default matrix size of (2048x2048), I get the following run times:
OpenCL sgemm: 84 seconds
CPU BLAS sgemm: 3 seconds
So the single core CPU version is abut 30 times faster than the OpenCL version!
I've checked an the output is the same for both cases.
I suspect that performance is being choked in the OpenCL code by the sheer number of loads from global memory. On the systems with Local memory, this can be cached so performance doesn't choke. I assume the CPU is caching data fairly well and speeding this up.
If this is indeed the case, performance will be bad on anything other than extremely simple kernels.
Am I making some sort of mistake here? Is there anything I can do to mitigate this? Essentially 90% of the processing time in most of my stuff is a matrix multiply.
Thanks for all of your help.
--Mike
> The transpose could be done on the CPU, handing over to the GPU when done - or with some thought possibly even part-way through
Which is what I'm now doing Will do 2 variants, one doing the transpose on the CPU and one on the GPU (I expect the transpose itself to be faster on the CPU due to the cache considerations listed above) and see what numbers I get. Could move on from there to overlap transposition and multiplication work to reduce the latency of the output. On paper, post transposition, the multiplication work should be vastly faster than it is currently.
> It could be that this particular algorithm is better suited to the CPU.
This particular implementation certainly is due to the cache-unfriendly manner in which it loads data.
Chris