This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Performance and SGEMM example

Hi,

I've gotten most of my code working with the Chromebook development platform.  Simple kernels run fine and show a nice performance speedup.  However more complex kernels perform poorly.

As per the suggestions of some members of this forum, I've reorganized one of my OpenCL kernels to avoid using local memory and to use the float4 vectors.  The routine I was optimizing was a simple covariance calculation for large data sets.  Since this is essentially a modified matrix multiply, I used the provided sgemm.cl as an example.  This resulted in a pretty substantial speed up in my code (about a factor of 2-3).

Since I can't guarantee that a client's system has a function OpenCL system,  I have a standard CPU version of the same function.  Running both of these on the chromebook I get the following performance figures for a data set:

Using local memory (i.e. optimized for nVidia):   24.5 seconds

Using vectors and no local memory (i.e. optimized for Mali):  9.26 seconds

Using CPUs with calls to BLAS (no OpenCL): 6.99 seconds

So here despite optimization effort, it is still quicker to use the CPU to do the processing.

Just checking to see if I did anything wrong, I checked the provided sgemm.cpp file.  As a comparison, I ran the BLAS CPU version of sgemm (installed using apt-get) against the provided OpenCL example.  For the default matrix size of (2048x2048),  I get the following run times:

OpenCL sgemm: 84 seconds

CPU BLAS sgemm: 3 seconds

So the single core CPU version is abut 30 times faster than the OpenCL version!

I've checked an the output is the same for both cases. 

I suspect that performance is being choked in the OpenCL code by the sheer number of loads from global memory.  On the systems with Local memory, this can be cached so performance doesn't choke.  I assume the CPU is caching data fairly well and speeding this up.

If this is indeed the case, performance will be bad on anything other than extremely simple kernels.

Am I making some sort of mistake here? Is there anything I can do to mitigate this?  Essentially 90% of the processing time in most of my stuff is a matrix multiply.

Thanks for all of your help.

--Mike

Parents
  • Hi Mike,

    The slow performance appears to be related to DVFS, and is affecting all the SDK samples, not just sgemm.  You can disabled DVFS as follows...

    echo off > /sys/class/misc/mali0/device/dvfs

    I'm seeing the time for sgemm reduce from around 84s to just under 10.  Can you try this and confirm you are seeing a similar speed-up?  There is clearly an issue with this within the BSP, which is something we will have to investigate further.  In the meantime, the above is a relatively simple workaround.

    In your question you said you were measuring the CPU BLAS sgemm at around 3s.  So this is still around 3x faster than the GPU.  It could be that this particular algorithm is better suited to the CPU.  What is likely taking the time is the relative inefficiency of the column reads, and it is quite possible a single or dual-threaded approach as seen on the CPU is likely to be less of a problem when it comes to cache efficiency than the thousands-of-threads model on the GPU.  It is possible to consider splitting the sgemm function into two stages: the first does a transpose, whilst the second then proceeds to do the calculation which can now be done with more cache-friendly linear reads.  The transpose could be done on the CPU, handing over to the GPU when done - or with some thought possibly even part-way through.  On the back of your question we are considering some further analysis along these lines.  Whether this is faster overall of course is something hard to tell without trying it - which is often the case with CL development.

    In defence of the GPU I should add a couple of things.  We have seen a number of relatively complex kernels that show a considerable speed up when compared to the CPU equivalent.  What it comes down to is matching the algorithm to the strengths of the cores available to you, even going to the lengths of spreading different parts of the workload around the CPU and GPU cores.  Also do bear in mind that on the Chromebook the GPU is running at 533MHz whilst the CPU's run at 1.7GHz.  So in terms of performance per cycle, even with this "GPU-unfriendly" transposition it isn't doing too badly.  But possibly more important than that is the amount of power being used for the job.  The GPU will typically do the same work using much less energy.  On mobile platforms of course this is an important consideration.

    Let me know how the above change effects you.  We'll update this thread with any further analysis of the sgemm kernel.

    Best regards,
    Tim

Reply
  • Hi Mike,

    The slow performance appears to be related to DVFS, and is affecting all the SDK samples, not just sgemm.  You can disabled DVFS as follows...

    echo off > /sys/class/misc/mali0/device/dvfs

    I'm seeing the time for sgemm reduce from around 84s to just under 10.  Can you try this and confirm you are seeing a similar speed-up?  There is clearly an issue with this within the BSP, which is something we will have to investigate further.  In the meantime, the above is a relatively simple workaround.

    In your question you said you were measuring the CPU BLAS sgemm at around 3s.  So this is still around 3x faster than the GPU.  It could be that this particular algorithm is better suited to the CPU.  What is likely taking the time is the relative inefficiency of the column reads, and it is quite possible a single or dual-threaded approach as seen on the CPU is likely to be less of a problem when it comes to cache efficiency than the thousands-of-threads model on the GPU.  It is possible to consider splitting the sgemm function into two stages: the first does a transpose, whilst the second then proceeds to do the calculation which can now be done with more cache-friendly linear reads.  The transpose could be done on the CPU, handing over to the GPU when done - or with some thought possibly even part-way through.  On the back of your question we are considering some further analysis along these lines.  Whether this is faster overall of course is something hard to tell without trying it - which is often the case with CL development.

    In defence of the GPU I should add a couple of things.  We have seen a number of relatively complex kernels that show a considerable speed up when compared to the CPU equivalent.  What it comes down to is matching the algorithm to the strengths of the cores available to you, even going to the lengths of spreading different parts of the workload around the CPU and GPU cores.  Also do bear in mind that on the Chromebook the GPU is running at 533MHz whilst the CPU's run at 1.7GHz.  So in terms of performance per cycle, even with this "GPU-unfriendly" transposition it isn't doing too badly.  But possibly more important than that is the amount of power being used for the job.  The GPU will typically do the same work using much less energy.  On mobile platforms of course this is an important consideration.

    Let me know how the above change effects you.  We'll update this thread with any further analysis of the sgemm kernel.

    Best regards,
    Tim

Children